← ML Research Wiki / 1310.4546

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov [email protected], Ilya Sutskever [email protected], Kai Chen, Greg Corrado [email protected], Jeffrey Dean, Google Inc. Mountain View, Google Inc. Mountain View (2013)

Paper Information
arXiv ID
Venue
Neural Information Processing Systems
Domain
natural language processing
SOTA Claim
Yes
Code
Available
Reproducibility
9/10

Abstract

The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships.In this paper we present several extensions that improve both the quality of the vectors and the training speed.By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations.We also describe a simple alternative to the hierarchical softmax called negative sampling.An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases.For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada".Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

Summary

This paper presents the continuous Skip-gram model for learning high-quality distributed vector representations of words and phrases, highlighting several extensions to improve both the quality and training efficiency. Key enhancements include subsampling of frequent words to speed up training and improve representation for rarer words, and the introduction of negative sampling as an alternative to hierarchical softmax to optimize training. The study effectively demonstrates how idiomatic phrases can be represented as unique tokens, enhancing expressiveness. Empirical results illustrate the efficacy of these methods through analogical reasoning tasks, revealing that negative sampling outperforms hierarchical softmax, particularly when combined with frequency subsampling. The paper underscores the linear compositionality of word vectors, facilitating meaningful analogical reasoning through simple vector arithmetic. Lastly, it compares various word representation models, establishing the Skip-gram model's superiority due to its extensive training on a large corpus, resulting in significantly higher quality representations.

Methods

This paper employs the following methods:

  • Skip-gram
  • Hierarchical Softmax
  • Negative Sampling
  • Noise Contrastive Estimation

Models Used

  • Skip-gram

Datasets

The following datasets were used in this research:

  • None specified

Evaluation Metrics

  • None specified

Results

  • Negative Sampling outperformed Hierarchical Softmax on analogical reasoning tasks.
  • Subsampling of frequent words improved training speed and accuracy of word representations.

Limitations

The authors identified the following limitations:

  • Inability of word representations to account for word order and idiomatic phrases.

Technical Requirements

  • Number of GPUs: None specified
  • GPU Type: None specified

Keywords

word embeddings Skip-gram negative sampling Phrases Distributed representations

Papers Using Similar Methods

External Resources