← ML Research Wiki / 1301.3781

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov [email protected] Google Inc Mountain ViewCA, Kai Chen [email protected] Google Inc Mountain ViewCA, Greg Corrado [email protected] Google Inc Mountain ViewCA, Jeffrey Dean Google Inc Mountain ViewCA (2013)

Paper Information

arXiv ID

1301.3781

Venue

International Conference on Learning Representations

Domain

Natural Language Processing

SOTA Claim

Yes

Code

Available

Reproducibility

9/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities. arXiv:1301.3781v3 [cs.CL] 7 Sep 2013 1 The test set is available at www.fit.vutbr.cz/˜imikolov/rnnlm/word-test.v1.txt 2

Summary

The paper proposes two novel model architectures for efficiently learning continuous vector representations of words from large datasets, achieving improved performance in a word similarity task compared to previous techniques. The proposed models, Continuous Bag-of-Words (CBOW) and Skip-gram, focus on leveraging large corpora (up to 1.6 billion words) to derive high-quality embeddings and demonstrate state-of-the-art performance on syntactic and semantic similarities in word relationships. The authors employ a comprehensive test set of 8869 semantic and 10675 syntactic questions to evaluate the accuracy and effectiveness of the learned vectors, revealing notable improvements in both accuracy and computational efficiency over prior models. Additionally, the paper discusses how model complexity can be managed while maximizing accuracy, and presents results from testing various model architectures, including Recurrent Neural Networks (RNNs) and Feedforward Neural Networks (NNMs). The architectures aim to balance training complexity and word representation quality, enabling potential applications in various NLP tasks such as machine translation and information retrieval.

Methods

This paper employs the following methods:

Continuous Bag-of-Words (CBOW)
Skip-gram

Models Used

Continuous Bag-of-Words (CBOW)
Skip-gram
Feedforward Neural Net Language Model (NNLM)
Recurrent Neural Net Language Model (RNNLM)

Datasets

The following datasets were used in this research:

Google News
1.6 billion words dataset

Evaluation Metrics

Accuracy

Results

Achieved state-of-the-art performance on test set for measuring syntactic and semantic word similarities.
Training word vectors from a 1.6 billion words dataset takes less than a day.
Proposed models outperform existing architectures in both training efficiency and accuracy.

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

word vectors vector space neural networks distributed representations semantic relationships syntactic regularities

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 36
Influential Citations: 4143

Efficient Estimation of Word Representations in Vector Space

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers