← ML Research Wiki / 1310.4546

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov [email protected], Ilya Sutskever [email protected], Kai Chen, Greg Corrado [email protected], Jeffrey Dean, Google Inc. Mountain View, Google Inc. Mountain View (2013)

Paper Information

arXiv ID

1310.4546

Venue

Neural Information Processing Systems

Domain

natural language processing

SOTA Claim

Yes

Code

Available

Reproducibility

9/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships.In this paper we present several extensions that improve both the quality of the vectors and the training speed.By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations.We also describe a simple alternative to the hierarchical softmax called negative sampling.An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases.For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada".Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

Summary

This paper presents the continuous Skip-gram model for learning high-quality distributed vector representations of words and phrases, highlighting several extensions to improve both the quality and training efficiency. Key enhancements include subsampling of frequent words to speed up training and improve representation for rarer words, and the introduction of negative sampling as an alternative to hierarchical softmax to optimize training. The study effectively demonstrates how idiomatic phrases can be represented as unique tokens, enhancing expressiveness. Empirical results illustrate the efficacy of these methods through analogical reasoning tasks, revealing that negative sampling outperforms hierarchical softmax, particularly when combined with frequency subsampling. The paper underscores the linear compositionality of word vectors, facilitating meaningful analogical reasoning through simple vector arithmetic. Lastly, it compares various word representation models, establishing the Skip-gram model's superiority due to its extensive training on a large corpus, resulting in significantly higher quality representations.

Methods

This paper employs the following methods:

Skip-gram
Hierarchical Softmax
Negative Sampling
Noise Contrastive Estimation

Models Used

Skip-gram

Datasets

The following datasets were used in this research:

None specified

Evaluation Metrics

None specified

Results

Negative Sampling outperformed Hierarchical Softmax on analogical reasoning tasks.
Subsampling of frequent words improved training speed and accuracy of word representations.

Limitations

The authors identified the following limitations:

Inability of word representations to account for word order and idiomatic phrases.

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

word embeddings Skip-gram negative sampling Phrases Distributed representations

Papers Using Similar Methods

External Resources

Funding: Google Inc.
References: 24
Influential Citations: 4056

Distributed Representations of Words and Phrases and their Compositionality

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers