Embeddings in Machine Learning

Hidden secret to empower semantic search

Xin Cheng
14 min readJun 1, 2023

This is the third article of building LLM-powered AI applications series. From the previous article, we know that in order to provide context to LLM, we need semantic search and complex query to find relevant context (traditional keyword search, full-text search won’t be enough). To enable semantic search, we need something called embedding/vector/vector embedding.

Intuitively, think when you compare if two things are similar to each other, you want to represent the two things (text, image, video, audio or ideally anything that can be digitized) as two points and then find how close they are. Think how you can find two points are close to each other in a 2-dimensional space, like in this article. On high-level, you need 2 steps:

  1. Find a way to represent things as points in high-dimensional space (while preserving the semantic meaning, e.g. somehow “queen” point is closer to “king” point than to “window” point). Generally 2 dimensions are not enough to represent complex things, it could be 100–1000, since too many dimensions would have diminishing returns. This is embedding/vector/vector embedding for this article.
  2. Use algorithm to determine closeness/similarity of points. This is the semantic search we are going to talk about in next article.

Overview

Vector indexing: when you have millions or more vectors, searching through them would be very tedious without indexing. Like traditional database index, vector index organizes the vectors into a data structure and makes it possible to navigate through the vectors and find the ones that are closest in terms of semantic similarity.

Used geometry concept to explain what is vector, and how raw data is transformed to embedding using embedding model.

Used a picture of phrase vector to explain vector embedding. A few embeddings for different data type

For text data, models such as Word2Vec, GLoVE, and BERT transform words, sentences, or paragraphs into vector embeddings.

Images can be embedded using models such as convolutional neural networks (CNNs), Examples of CNNs include VGG, and Inception.

Audio recordings can be transformed into vectors using image embedding transformations over the audio frequencies visual representation (e.g., using its Spectrogram).

Embedding applications

  1. Recommendation systems (i.e. Netflix-style if-you-like-these-movies-you’ll-like-this-one-too)
  2. All kinds of search
  3. Text search (like Google Search)
  4. Image search (like Google Reverse Image Search)
  5. Chatbots and question-answering systems
  6. Data preprocessing (preparing data to be fed into a machine learning model)
  7. One-shot/zero-shot learning (i.e. machine learning models that learn from almost no training data)
  8. Fraud detection/outlier detection
  9. Typo detection and all manners of “fuzzy matching”
  10. Detecting when ML models go stale (drift)

An embedding is a mapping from discrete objects, such as words, to vectors of real numbers.

The individual dimensions in these vectors typically have no inherent meaning. Instead, it’s the overall patterns of location and distance between vectors that machine learning takes advantage of.

The article also created 3-dimensional embedding to show the embedding in 3D plot

An encoder-decoder will effectively compress the data to the z latent vector (you can still call this an embedding).

An embedding is a low-dimensional vector representation that captures relationships in higher dimensional input data. Distances between embedding vectors capture similarity between different datapoints, and can capture essential concepts in the original input.

Methods to create embeddings

  1. One-hot encoding
  2. Matrix Factorisation
  3. Word2Vec
  4. GloVe

Explained one way to create vector embedding for image, e.g. in ResNet neural network architecture, which is used to classify images, the layer before final classification layer, which has 768, or 500 hidden units, or latent space, provides a dense representation packed with information about present features that is computationally feasible for tasks like visual similarity search

Embedding is a way to create features, e.g. using one-hot encoding you can create features for each token, however, if you have 10K different tokens, the dimension will be 10K, which can cause curse-of-dimensionality. We need a way to use lower-dimensionality to represent most information as one-hot encoding does.

Other embedding models

  1. Principal Component Analysis (PCA)
  2. SVD
  3. BERT

Embeddings models

Text embeddings

Word2Vec: simple and shallow (3 layers) neural network with two modes to learn word representations from large unlabeled data. The two training modes are called Continuous Bag Of Words (CBOW) and Skip-gram. It is good to capture syntactic relationships and analogies between words (e.g., “king” — “man” + “woman” ≈ “queen”)

Glove: The main difference between GloVe and Word2Vec is that a), unlike Word2Vec which is a prediction-based model, Glove is a count-based method and b) Word2Vec only considers the local properties of the dataset whereas GloVe considers the global properties in addition to local ones. It leverages the overall word co-occurrence information in the entire corpus and computes word vectors based on the probability of a word appearing near another word in the corpus. GloVe captures both semantic and syntactic relationships by considering the global word-to-word co-occurrence patterns.

FastText: built on top of the Skip-gram method but mitigates the limitation of out of vocabulary words (words outside of the trained vocabulary). FastText breaks down words into a smaller sequence of characters called n-grams. For example, for n = 3 the 3-grams of the word dog become: “<do”, “dog”, “og>” and a special sequence “<dog>” denoting the entire word. This method is effective because it learns representations of subwords that are shared among different words, and therefore an unseen word is dissected into its composing n-grams which very likely have been seen during training. The final word embedding is computed as the sum of its constituent n-gram embeddings.

Embeddings from Language Models (ELMo): incorporate LSTMs in order to capture more contextual information. However, it was not designed for transfer learning and needs to be trained for specific tasks using a separate model.

Bidirectional Encoder Representations from Transformers (BERT): like ELMo, BERT CAN generate different word embeddings for a word that captures the context of a word — that is its position in a sentence

A practical implication of this difference is that we can use word2vec and Glove vectors trained on a large corpus directly for downstream tasks. All we need is the vectors for the words. There is no need for the model itself that was used to train these vectors.

However, in the case of ELMo and BERT, since they are context dependent, we need the model that was used to train the vectors even after training, since the models generate the vectors for a word based on context. We can just use the context independent vectors for a word if we choose too (just get the raw trained vector from trained model) , but would defeat the very purpose/advantage of these models.

word embeddings: way of representing words as high-dimensional vectors.

sentence embeddings: represent entire sentences.

Applications in NLP tasks (high-level intuition: if task needs contextual information like in machine translation, simple word embeddings won’t work well, either use BERT or sentence embeddings)

Text classification: Word embeddings can be used to represent the words in a text document and then fed into a classification model, such as logistic regression or a support vector machine (SVM). The resulting model can then be used to classify new documents based on their content. Sentence embeddings can also be used in text classification by representing entire sentences as high-dimensional vectors and then feeding them into a classifier.

Named entity recognition: Word embeddings can be used to identify named entities in text, such as people, organizations, and locations. This can be done by training a named entity recognition model on a corpus of text that has been annotated with entity labels.

Sentiment analysis: Sentence embeddings can be used to analyze the sentiment of a piece of text, such as whether it is positive or negative. This can be done by training a sentiment analysis model on a corpus of text that has been labeled with sentiment scores.

Word embeddings

Use Spark NLP to compare Glove, Elmo and BERT on classification task (classify if a twitter is talking about disaster): able to see that GloVe embeddings lacked context. It was unable to differentiate tsunami the restaurant from the actual disaster.

Sentence embeddings

The article compares TFIDF embeddings and sentence transformers embeddings to explain why we need sentence transformers embeddings: with a plot, data points with the same category are sticking closer together in sentence transformers embeddings (which is the goal of embeddings, make things related closer and non-related things farther)

Simplified explanation on why Transformer-model-based models perform much better than RNNs in various NLP tasks (e.g. answer questions, write articles): for many tasks, the latter parts of these models are the same as those in RNNs — often a couple of feedforward NNs that output model predictions; It’s the input to these layers that changed. The dense embeddings created by transformer models are so much richer in information that we get massive performance benefits despite using the same final outward layers.

Sentence embeddings use cases:

  • Semantic textual similarity (STS) — comparison of sentence pairs. We may want to identify patterns in datasets, but this is most often used for benchmarking.
  • Semantic search — information retrieval (IR) using semantic meaning. Given a set of sentences, we can search using a ‘query’ sentence and identify the most similar records. Enables search to be performed on concepts (rather than specific words).
  • Clustering — we can cluster our sentences, useful for topic modeling.

There are two ways to get sentence embeddings:

  1. One solution is to get sentence embedding from each word embedding. Word embedding has been there for a long time, since Word2Vec, GloVe and BERT (links). Those models can generate one embedding per word. Take BERT as an example. After the sentences were inputted to BERT, the most common way to generate a sentence embedding was by averaging all the word-level embeddings or taking the [CLS] token. However, this is computationally expensive as it goes through each word one by one, which makes it hard to use in the critical production environment.
  2. The second solution is to get sentence embedding directly. The basic idea is that we apply the model, which can get embedding from the sentence so that they can do the text classification task, and then we can get the embedding for the whole sentence.

There are several widely-used models listed below. Currently, the most popular one is SBERT.

  1. Doc2Vec
  2. SBERT
  3. InferSent
  4. Universal Sentence Encoder

The article draws an example where word embedding would fall short: we come across a sentence like ‘I don’t like crowded places’, and a few sentences later, we read ‘However, I like one of the world’s busiest cities, New York’. How can we make the machine draw the inference between ‘crowded places’ and ‘busy cities’?

Doc2Vec: introduced in 2014, adds on to the Word2Vec model by introducing another ‘paragraph vector’.

SentenceBERT: Currently, the leader among the pack, SentenceBERT was introduced in 2018 and immediately took the pole position for Sentence Embeddings. At the heart of this BERT-based model, there are 4 key concepts: Attention, Transformers, BERT, Siamese Network

InferSent: a supervised sentence embedding technique presented by Facebook AI Research in 2018

Universal Sentence Encoder: One of the most well-performing sentence embedding techniques right now, released by Google

SentenceBERT vs Universal Sentence Encoder

  • USE and SBERT both use transformer networks. For USE, it is sadly not clear how many layers they use (most technical details are not provided). USE was trained from scratch (as far as I can tell from the paper), while SBERT uses the BERT / RoBERTa pre-trained wights and just fine-tunes them to produce sentence embeddings.
  • The main difference is in the pre-training. USE uses a wide variety of data sets (exact details not provided), specifically target for generating sentence embeddings. BERT was pre-trained on a book corpus and on Wikipedia for producing a language model (see the BERT paper). SBERT than fine-tunes BERT to produce sensible sentence embeddings.
  • USE is in TensorFlow and tuning for your use-case is not straightforward (source code not available, you only get the compiled model from tensorflow-hub). SBERT is based on pytorch and the goal of this repository is, that fine-tuning for your use-case is as simple as possible.

The example includes popular sentence embeddings model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensional). The other one is sentence-transformers/all-mpnet-base-v2 (768 dimensional)

Image embeddings

The steps to find similar beans images with embeddings:

  1. Extract the embeddings from the candidate images (candidate_subset), storing them in a matrix (using a fine-tuned Vision Transformer based model on the beans dataset.).
  2. Take a query image and extract its embeddings.
  3. Iterate over the embedding matrix (computed in step 1) and compute the similarity score between the query embedding and the current candidate embeddings. We usually maintain a dictionary-like mapping maintaining a correspondence between some identifier of the candidate image and the similarity scores (what if there are lots of candidate images?).
  4. Sort the mapping structure w.r.t the similarity scores and return the underlying identifiers. We use these identifiers to fetch the candidate samples.

Others

There are also embeddings model for video (frame then OpenAI Clip), audio (wav2vec, PANNs), but not as popular as text embeddings. And there is general framework data2vec for transformers on different modalities, e.g. speech, vision, language.

Embedding as a service

It has the following models
Text similarity: Captures semantic similarity between pieces of text. text-similarity-{ada, babbage, curie, davinci}-001, use cases: Clustering, regression, anomaly detection, visualization
Text search: Semantic information retrieval over documents. text-search-{ada, babbage, curie, davinci}-{query, doc}-001, use cases: Search, context relevance, information retrieval
Code search: Find relevant code with a query in natural language. code-search-{ada, babbage}-{code, text}-001, use cases: Code search and relevance

OpenAI GPT-3 model performance and cost much worse than sentence transformers model in 2021.12 benchmark

Use cases for OpenAI embeddings (released 2022.12)

  • Search (where results are ranked by relevance to a query string)
  • Clustering (where text strings are grouped by similarity)
  • Recommendations (where items with related text strings are recommended)
  • Anomaly detection (where outliers with little relatedness are identified)
  • Diversity measurement (where similarity distributions are analyzed)
  • Classification (where text strings are classified by their most similar label)

Improvements

  1. One /embeddings endpoint merges the five separate models

text-similarity
text-search-query
text-search-doc
code-search-text
code-search-code

2. Features

Longer context: The context length of the new model is increased by a factor of four, from 2048 to 8192, making it more convenient to work with long documents.

Smaller embedding size. The new embeddings have only 1536 dimensions, one-eighth the size of davinci-001 embeddings, making the new embeddings more cost-effective in working with vector databases.

Reduced price. We have reduced the price of new embedding models by 90% compared to old models of the same size. The new model achieves better or similar performance as the old Davinci models at a 99.8% lower price.

OpenAI updated in December 2022 the Embedding model to text-embedding-ada-002. The new model offers:

  • 90%-99.8% lower price
  • 1/8th embeddings dimensions size reduces vector database costs
  • Endpoint unification for ease of use
  • State-of-the-Art performance for text search, code search, and sentence similarity
  • Context window increased from 2048 to 8192.

The article is clustering “Fine Food Reviews” dataset. Steps are:

  1. Generate embeddings using OpenAI endpoint
  2. Use KMeans to cluster embeddings
  3. For reviews in the same cluster, retrieve review text (combined column Summary & Text) and send to OpenAI endpoint to summarize theme (what is common) of those reviews in the same cluster
  4. Above does not use vector database. Now the article stores the embeddings to pinecone and query the embeddings

Hugging Face hosts Massive Text Embedding Benchmark (MTEB) Leaderboard (MTEB Leaderboard — a Hugging Face Space by mteb) for measuring the performance of text embedding models on diverse embedding tasks.

  1. Using pre-trained models without task-specific fine-tuning: Using the direct vector representations from the model that have only been pre-trained will not produce a useful embedding representation for any task. Search ranking is an example of such a task; see details in How not to use BERT for search ranking.
  2. Using fine-tuned single vector embedding models out-of-domain: when we take a single vector representation model, fine-tuned on MS MARCO labels, it does not beat BM25 in a different domain with slightly different types of documents and questions. Multi-vector representation model for search, like ColBERT, generalizes much better than single-vector representations.
  3. Lack of understanding of vector search tradeoffs: do we need to introduce an approximate nearest neighbor search (ANNS) instead of an exact nearest neighbor search? As in many aspects of life, this is a question of tradeoffs on e.g. query serving: Latency Service Level Agreement (SLA), query throughput, accuracy. The exact search for neighbors will brute-force calculate the distance between the query and all eligible documents, generally bearing high cost.

Word2vec efficiently learns word embeddings by training a shallow neural network to predict the context of a word included in a vocabulary, defined by a sliding window of a given amplitude, with the key idea to preserve the semantic of the words.

Knowledge graph embedding algorithms have become a powerful tool for representing and reasoning about complex structured data. These algorithms learn low-dimensional embeddings of entities and relations in a knowledge graph, allowing for efficient computation of similarity and inference tasks.

Appendix

https://learn.deeplearning.ai/google-cloud-vertex-ai

BECOME a WRITER at MLearning.ai // AI Agents // Super Cheap AI.

--

--

Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified