Introduction to word2vec

dl4ee: deep learning for electrical engineers

7 min readJan 4, 2024

Why another article for word2vec?

There are many articles discussing word2vec’s techniques. This article starts with the bigger picture of how it serves as the first step into the Natural Language Processing world.
The article presents word2vec analytically for people with a basic electrical engineering background. Math equations and software programs are avoided as much as possible. The simplest concept of Linear Algebra, and Stochastic Process is used to explain it.
During my conversation with others, the concept of Negative Sampling seemed the most difficult to comprehend. Even Goldberg and Levy published another article just on that topic a year after word2vec was released. This article will attempt a more straightforward explanation.

I found CS224N taught by Manning at Stanford University to be intuitive and instructive. So I’ve used material from his lecture as the baseline in this article.

Distributional Semantics

In order to process a language, the first thing we need to do is to create a mathematical model of words. In our daily life, when we encounter a new word we use a dictionary to find its meaning. The technique is called the denotation semantics where the denotation is the linguistic expression pointing to our cognitive representation of the actual world.

So the easiest way of creating a model is to use a discrete symbol for each word. If we choose a vector for each symbol, we can use the so-called one-hot vector being “1” for an element in the vector. For example, “motel” and “hotel” can be represented as follows.

An example of one-hot encoding of the words “motel” and hotel”

However, this model is difficult to use. For 20,000 words, the vector has 20,000 dimensions so it’s very heavy to compute. It’s also missing information because “motel” and “hotel” are obviously related to each other but their vectors are orthogonal to each other just like all other words.

WordNet is a more sophisticated model. According to its web site, WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings.

The problem with WordNet is the words in each synsets could miss nuance. For example, “proficient” is listed as a synonym for “good” but only correct in some contexts. The database requires human labor to create and adapt so it’s subjective and hard to keep up with the new meanings of words. It also cannot quantify how “similar” those words are whether they are in the same synset or not.

In the modern statistical NLP, a more popular way of representing words is by their context, aka Distributional Semantics. A word’s meaning is given by the words that frequently appear close-by. When a word w appears in a text, its context is the set of words that appear nearby, within a fixed-size window. For example, “banking” can be presented by the contexts as follows:

Understanding “banking” using its context words nearby

word2vec can successfully use this concept to represent an English word with just around 300 dimensions in a vector. The word vector is also called word embedding because it’s embedded in a large 300-dimensional space.

word2vec

word2vec starts with a large corpus (“body”) of text. Every word is represented by a vector. It goes through each position t in the text, which has a center word c and context (“outside”) words o. Using the similarity of the word vectors for c and o to calculate the probability of o given c. The goal is to keep adjusting the word vectors in order to maximize the probability (likelihood) as follows:

Using the Stochastic Process theory to explain this, each word is a random variable that can be any word in the vocabulary. A corpus is simply a realization or a sample of the random process of these random variables in sequence. The goal is essentially to match the probability distribution of this realization by adjusting the vectors.

If you don’t know enough about Stochastic Process theory, here’s a simple example. You flip a coin 10 times to produce a sequence of Head and Tail as follows:

H T T T H H T H T T

You then try to find what the probability is for Head and Tail by analyzing sequence. Note that the realization may not truly represent the true probability. In this case, there are 4 H’s and 6 T’s. Assuming Head and Tail have the same probability, this realization clearly does not represent the true probability. However, whenthe realization’s sample size is big enough, there is a much higher chance it will represent the true underlying probability distribution.

Therefore the corpus of text needs to be large enough in order to properly reveal the stochastic relationship of different word vectors.

In order to associate the vectors to the conditional probability of context words and center words, two mathematical concepts and one numerical operation are introduced here — dot product and softmax functions, and log function.

The best way to visualize the purpose of dot products is that geometrically it represents how close two vectors are to each other. Quoting Wikipedia, it’s a product of the scalar projection of one vector in the direction of the other vector. In this case, it’s A•B = (|A| cos θ) ✖ ︎|B|

Scalar project of vector A over vector B

Softmax is a function frequently used to represent the stochastic nature of any metrics. Besides the numerical property of producing a number between 0 and 1, it also ties nicely into the Information Theory where the information entropy can be theoretically analyzed. This is a fascinating subject but beyond the scope of the article.

By applying the log function to ease the computation from multiplication to addition, here’s the cost function for optimization. Note that this is a “cost” function to be minimized so a negative sign is applied to the “objective” function of the likelihood.

Negative Sampling

Fow now, word2vec’s thought process is reasonably mathematically rigorous. The corpus represents a realization of a stochastic process so a likelihood function using the probability theory is proposed followed by an approximation of the inner product for distance and a softmax for probability.

However, the approximation step can further reduce the computation complexity without losing its fidelity when being processed in the deep learning computation later. The biggest complexity obviously comes from the denominator where the entire vocabulary needs to be processed in reference to the center word. So instead of a stochastic process, the problem is now treated as an optimization problem, in which only a deterministic cost function is needed as long as it delivers the original objective of establishing a relationship between the center word and the context words.

Using the original objective function as the guidance, the concept of softmax is removed while keeping the inner product as the distance metric between two words. A new sigmoid function is introduced to normalize the numerical range of the distance to 0 and 1. The new objective function now maximizes the distance of the target center-context pair while minimizing a much smaller set of non-context words. The smaller set of non-context words are chosen based on its occurrence probability in the corpus, unigram distribution U(w). Because the occurrence probability varies widely, an exponent ¾ is introduced to reduce its range, i.e. making the less frequent words being chosen more often and vice versa. So P(w) = U(w)^(3/4) / Z where Z is just used to normalize the sum of U(w)^(3/4) to one for all w’s

Final Words

As stated at the beginning, this article intends to provide a quick analytical interpretation of word2vec as the first step towards NLP. If you are interested in the detailed explanations and software programs, you can easily find them in the original paper or elsewhere online.

References

“Efficient Estimation of Word Representations in Vector Space”, Mikolov, et al, https://arxiv.org/abs/1301.3781

Stanford CS224N: NLP with Deep Learning, Chris Manning, https://www.youtube.com/watch?v=rmVRLeJRkl4&list=PLoROMvodv4rOSH4v6133s9LFPRHjEmbmJ

“word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method”, Yoav Goldberg, Omer Levy, https://arxiv.org/abs/1402.3722

Introduction to word2vec

dl4ee: deep learning for electrical engineers

Why another article for word2vec?

Distributional Semantics

word2vec

Negative Sampling

Final Words

References

WRITER at MLearning.ai / New York Times vs. AI / The Best 2023 AI

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Pen C. Li