← ML Research Wiki / 1301.3781

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov [email protected] Google Inc Mountain ViewCA, Kai Chen [email protected] Google Inc Mountain ViewCA, Greg Corrado [email protected] Google Inc Mountain ViewCA, Jeffrey Dean Google Inc Mountain ViewCA (2013)

Paper Information
arXiv ID
Venue
International Conference on Learning Representations
Domain
Natural Language Processing
SOTA Claim
Yes
Code
Reproducibility
9/10

Abstract

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities. arXiv:1301.3781v3 [cs.CL] 7 Sep 2013 1 The test set is available at www.fit.vutbr.cz/˜imikolov/rnnlm/word-test.v1.txt 2

Summary

The paper proposes two novel model architectures for efficiently learning continuous vector representations of words from large datasets, achieving improved performance in a word similarity task compared to previous techniques. The proposed models, Continuous Bag-of-Words (CBOW) and Skip-gram, focus on leveraging large corpora (up to 1.6 billion words) to derive high-quality embeddings and demonstrate state-of-the-art performance on syntactic and semantic similarities in word relationships. The authors employ a comprehensive test set of 8869 semantic and 10675 syntactic questions to evaluate the accuracy and effectiveness of the learned vectors, revealing notable improvements in both accuracy and computational efficiency over prior models. Additionally, the paper discusses how model complexity can be managed while maximizing accuracy, and presents results from testing various model architectures, including Recurrent Neural Networks (RNNs) and Feedforward Neural Networks (NNMs). The architectures aim to balance training complexity and word representation quality, enabling potential applications in various NLP tasks such as machine translation and information retrieval.

Methods

This paper employs the following methods:

  • Continuous Bag-of-Words (CBOW)
  • Skip-gram

Models Used

  • Continuous Bag-of-Words (CBOW)
  • Skip-gram
  • Feedforward Neural Net Language Model (NNLM)
  • Recurrent Neural Net Language Model (RNNLM)

Datasets

The following datasets were used in this research:

  • Google News
  • 1.6 billion words dataset

Evaluation Metrics

  • Accuracy

Results

  • Achieved state-of-the-art performance on test set for measuring syntactic and semantic word similarities.
  • Training word vectors from a 1.6 billion words dataset takes less than a day.
  • Proposed models outperform existing architectures in both training efficiency and accuracy.

Limitations

The authors identified the following limitations:

  • Not specified

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

word vectors vector space neural networks distributed representations semantic relationships syntactic regularities

Papers Using Similar Methods

External Resources