Mikolav et al 2013 - Distributed Representations of Words and Phrases and their Compositionality

These are notes for mikolov2013distributed.

This paper introduces word2vec: an approach for learning word embeddings. Compare with bag-of-words, term-document matrix, and other vector space models of semantics (turney2010frequency).

1 Training

1.1 Skip-gram model

The model learns vector representations for words. The representations are used as parameters in a distribution. The distribution gives the likelihood of words in the context. Then, for a sequence of training words \(\{w_1,w_2,...\}\) the objective of the skip-gram model is: \[ \frac{1}{T} \sum_{i=1}^{T} \sum_{j=1}^{c} \log p(w_{i+j} \mid w_{i}) \]

Where \(c\) is the number of context words used during training. The probability \(p(w_{i+j} \mid w_i)\) is computed using softmax.

1.2 Hierarchical Softmax

Because softmax takes a long time to compute for large vocabulary \(|W|\), hierarchical softmax is used instead.

1.3 Negative Sampling

The distribution produced by the model should be able to distinguish between words in the context and random noise. Let \(v_w\) be the vector representation of \(w\). The following objective is used to replace \(p(w_O \mid w_I)\) in the above objective: \[\log \sigma ({v_{w_O}}^{\intercal} v_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w_I)} \log(1-\sigma(v_{w_i}^{\intercal} v_{w_I}))\] In the above, the LHS term is simply the log probability of the context word, conditioned on the input word. The RHS is the part introduced by negative sampling. \(P_{n}\) is the noise distribution, which grabs random words from outside the context. The RHS term encourages \(p(w_O \mid w_i)\) to be low for words outside of the context.

2 Helpful links

Hierarchical softmax and negative sampling

Bibliography

[mikolov2013distributed] Mikolov, Sutskever, Chen, Corrado & Dean, Distributed representations of words and phrases and their compositionality, 3111-3119, in in: Advances in neural information processing systems, edited by (2013)
[turney2010frequency] Turney & Pantel, From frequency to meaning: Vector space models of semantics, Journal of artificial intelligence research, 37, 141-188 (2010).