Embeddings in Machine Learning

Hidden secret to empower semantic search

14 min readJun 1, 2023

This is the third article of building LLM-powered AI applications series. From the previous article, we know that in order to provide context to LLM, we need semantic search and complex query to find relevant context (traditional keyword search, full-text search won’t be enough). To enable semantic search, we need something called embedding/vector/vector embedding.

Intuitively, think when you compare if two things are similar to each other, you want to represent the two things (text, image, video, audio or ideally anything that can be digitized) as two points and then find how close they are. Think how you can find two points are close to each other in a 2-dimensional space, like in this article. On high-level, you need 2 steps:

Find a way to represent things as points in high-dimensional space (while preserving the semantic meaning, e.g. somehow “queen” point is closer to “king” point than to “window” point). Generally 2 dimensions are not enough to represent complex things, it could be 100–1000, since too many dimensions would have diminishing returns. This is embedding/vector/vector embedding for this article.
Use algorithm to determine closeness/similarity of points. This is the semantic search we are going to talk about in next article.

Overview

Vector Embedding 101: The Key to Semantic Search

We've all been there: typing keywords into a search engine, trying to find the one piece of information we need. But…

dev.to

Vector indexing: when you have millions or more vectors, searching through them would be very tedious without indexing. Like traditional database index, vector index organizes the vectors into a data structure and makes it possible to navigate through the vectors and find the ones that are closest in terms of semantic similarity.

Vector Embeddings for Developers: The Basics | Pinecone

You might not know it yet, but vector embeddings are everywhere. They are the building blocks of many machine learning…

www.pinecone.io

Used geometry concept to explain what is vector, and how raw data is transformed to embedding using embedding model.

What are Vector Embeddings? | Pinecone

Vector embeddings are one of the most fascinating and useful concepts in machine learning. They are central to many…

www.pinecone.io

Used a picture of phrase vector to explain vector embedding. A few embeddings for different data type

For text data, models such as Word2Vec, GLoVE, and BERT transform words, sentences, or paragraphs into vector embeddings.

Images can be embedded using models such as convolutional neural networks (CNNs), Examples of CNNs include VGG, and Inception.

Audio recordings can be transformed into vectors using image embedding transformations over the audio frequencies visual representation (e.g., using its Spectrogram).

Meet AI's multitool: Vector embeddings | Google Cloud Blog

Developers & Practitioners Dale Markowitz Applied AI Engineer Embeddings are one of the most versatile techniques in…

cloud.google.com

Embedding applications

Recommendation systems (i.e. Netflix-style if-you-like-these-movies-you’ll-like-this-one-too)
All kinds of search
Text search (like Google Search)
Image search (like Google Reverse Image Search)
Chatbots and question-answering systems
Data preprocessing (preparing data to be fed into a machine learning model)
One-shot/zero-shot learning (i.e. machine learning models that learn from almost no training data)
Fraud detection/outlier detection
Typo detection and all manners of “fuzzy matching”
Detecting when ML models go stale (drift)

Learning embeddings for your machine learning model

How to learn embeddings representation for categorical variables.

medium.com

An embedding is a mapping from discrete objects, such as words, to vectors of real numbers.

The individual dimensions in these vectors typically have no inherent meaning. Instead, it’s the overall patterns of location and distance between vectors that machine learning takes advantage of.

The article also created 3-dimensional embedding to show the embedding in 3D plot

Neural Networks: It’s all about the embeddings

Simplifying hybrid and complex models by understanding feature embeddings

medium.com

An encoder-decoder will effectively compress the data to the z latent vector (you can still call this an embedding).

The Unreasonable Effectiveness Of Neural Network Embeddings

Neural network embeddings are remarkably effective in organizing and wrangling large sets of unstructured data.

medium.com

An embedding is a low-dimensional vector representation that captures relationships in higher dimensional input data. Distances between embedding vectors capture similarity between different datapoints, and can capture essential concepts in the original input.

Methods to create embeddings

Embeddings — The What, the Why, and the How?

I like noticing details that no one else sees — Anonymous

medium.com

One-hot encoding
Matrix Factorisation
Word2Vec
GloVe

Vector Embeddings: From the Basics to Production

Search capability is ingrained into our daily life. Arguments are commonly ended with the conclusion, "just google it"…

partee.io

Explained one way to create vector embedding for image, e.g. in ResNet neural network architecture, which is used to classify images, the layer before final classification layer, which has 768, or 500 hidden units, or latent space, provides a dense representation packed with information about present features that is computationally feasible for tasks like visual similarity search

Embeddings in Machine Learning: Everything You Need to Know | FeatureForm

Embeddings have pervaded the data scientist's toolkit, and dramatically changed how NLP, computer vision, and…

www.featureform.com

Embedding is a way to create features, e.g. using one-hot encoding you can create features for each token, however, if you have 10K different tokens, the dimension will be 10K, which can cause curse-of-dimensionality. We need a way to use lower-dimensionality to represent most information as one-hot encoding does.

Other embedding models

Principal Component Analysis (PCA)
SVD
BERT

Embeddings models

Text embeddings

An Overview of Different Embedding Models

Embeddings are an important component of natural language processing pipelines. They refer to the vector representation…

techblog.ezra.com

Word2Vec: simple and shallow (3 layers) neural network with two modes to learn word representations from large unlabeled data. The two training modes are called Continuous Bag Of Words (CBOW) and Skip-gram. It is good to capture syntactic relationships and analogies between words (e.g., “king” — “man” + “woman” ≈ “queen”)

Glove: The main difference between GloVe and Word2Vec is that a), unlike Word2Vec which is a prediction-based model, Glove is a count-based method and b) Word2Vec only considers the local properties of the dataset whereas GloVe considers the global properties in addition to local ones. It leverages the overall word co-occurrence information in the entire corpus and computes word vectors based on the probability of a word appearing near another word in the corpus. GloVe captures both semantic and syntactic relationships by considering the global word-to-word co-occurrence patterns.

FastText: built on top of the Skip-gram method but mitigates the limitation of out of vocabulary words (words outside of the trained vocabulary). FastText breaks down words into a smaller sequence of characters called n-grams. For example, for n = 3 the 3-grams of the word dog become: “<do”, “dog”, “og>” and a special sequence “<dog>” denoting the entire word. This method is effective because it learns representations of subwords that are shared among different words, and therefore an unseen word is dissected into its composing n-grams which very likely have been seen during training. The final word embedding is computed as the sum of its constituent n-gram embeddings.

Embeddings from Language Models (ELMo): incorporate LSTMs in order to capture more contextual information. However, it was not designed for transfer learning and needs to be trained for specific tasks using a separate model.

Bidirectional Encoder Representations from Transformers (BERT): like ELMo, BERT CAN generate different word embeddings for a word that captures the context of a word — that is its position in a sentence

What are the main differences between the word embeddings of ELMo, BERT, Word2vec, and GloVe?

Answer (1 of 3): The main difference between the word embeddings of Word2vec, Glove, ELMo and BERT is that * Word2vec…

www.quora.com

A practical implication of this difference is that we can use word2vec and Glove vectors trained on a large corpus directly for downstream tasks. All we need is the vectors for the words. There is no need for the model itself that was used to train these vectors.

However, in the case of ELMo and BERT, since they are context dependent, we need the model that was used to train the vectors even after training, since the models generate the vectors for a word based on context. We can just use the context independent vectors for a word if we choose too (just get the raw trained vector from trained model) , but would defeat the very purpose/advantage of these models.

Introduction to Word and Sentence Embedding

In the field of Natural Language Processing (NLP), the use of word and sentence embeddings has revolutionized the way…

abdulkaderhelwan.medium.com

word embeddings: way of representing words as high-dimensional vectors.

sentence embeddings: represent entire sentences.

Applications in NLP tasks (high-level intuition: if task needs contextual information like in machine translation, simple word embeddings won’t work well, either use BERT or sentence embeddings)

Text classification: Word embeddings can be used to represent the words in a text document and then fed into a classification model, such as logistic regression or a support vector machine (SVM). The resulting model can then be used to classify new documents based on their content. Sentence embeddings can also be used in text classification by representing entire sentences as high-dimensional vectors and then feeding them into a classifier.

Named entity recognition: Word embeddings can be used to identify named entities in text, such as people, organizations, and locations. This can be done by training a named entity recognition model on a corpus of text that has been annotated with entity labels.

Sentiment analysis: Sentence embeddings can be used to analyze the sentiment of a piece of text, such as whether it is positive or negative. This can be done by training a sentiment analysis model on a corpus of text that has been labeled with sentiment scores.

Word embeddings

GloVe, ELMo & BERT

A guide to state-of-the-art text classification using Spark NLP

towardsdatascience.com

Use Spark NLP to compare Glove, Elmo and BERT on classification task (classify if a twitter is talking about disaster): able to see that GloVe embeddings lacked context. It was unable to differentiate tsunami the restaurant from the actual disaster.

Sentence embeddings

Easily get high-quality embeddings with SentenceTransformers!

Introduction to the idea of vector representations and compare TF-IDF vectors with SentenceTransformers vectors!

towardsdatascience.com

The article compares TFIDF embeddings and sentence transformers embeddings to explain why we need sentence transformers embeddings: with a plot, data points with the same category are sticking closer together in sentence transformers embeddings (which is the goal of embeddings, make things related closer and non-related things farther)

Sentence Transformers and Embeddings | Pinecone

Once you learn about and generate sentence embeddings, combine them with the Pinecone vector database to easily build…

www.pinecone.io

Simplified explanation on why Transformer-model-based models perform much better than RNNs in various NLP tasks (e.g. answer questions, write articles): for many tasks, the latter parts of these models are the same as those in RNNs — often a couple of feedforward NNs that output model predictions; It’s the input to these layers that changed. The dense embeddings created by transformer models are so much richer in information that we get massive performance benefits despite using the same final outward layers.

Sentence embeddings use cases:

Semantic textual similarity (STS) — comparison of sentence pairs. We may want to identify patterns in datasets, but this is most often used for benchmarking.
Semantic search — information retrieval (IR) using semantic meaning. Given a set of sentences, we can search using a ‘query’ sentence and identify the most similar records. Enables search to be performed on concepts (rather than specific words).
Clustering — we can cluster our sentences, useful for topic modeling.

Top Pre-trained Models for Sentence Embedding

This article walks through top pre-trained models to get sentence embedding, which is a lower-dimensional numerical…

medium.com

There are two ways to get sentence embeddings:

One solution is to get sentence embedding from each word embedding. Word embedding has been there for a long time, since Word2Vec, GloVe and BERT (links). Those models can generate one embedding per word. Take BERT as an example. After the sentences were inputted to BERT, the most common way to generate a sentence embedding was by averaging all the word-level embeddings or taking the [CLS] token. However, this is computationally expensive as it goes through each word one by one, which makes it hard to use in the critical production environment.
The second solution is to get sentence embedding directly. The basic idea is that we apply the model, which can get embedding from the sentence so that they can do the text classification task, and then we can get the embedding for the whole sentence.

There are several widely-used models listed below. Currently, the most popular one is SBERT.

Top 4 Sentence Embedding Techniques using Python!

Learn about the word and sentence embeddings Know the top 4 Sentence Embedding Techniques used in the Industry The…

www.analyticsvidhya.com

The article draws an example where word embedding would fall short: we come across a sentence like ‘I don’t like crowded places’, and a few sentences later, we read ‘However, I like one of the world’s busiest cities, New York’. How can we make the machine draw the inference between ‘crowded places’ and ‘busy cities’?

Doc2Vec: introduced in 2014, adds on to the Word2Vec model by introducing another ‘paragraph vector’.

SentenceBERT: Currently, the leader among the pack, SentenceBERT was introduced in 2018 and immediately took the pole position for Sentence Embeddings. At the heart of this BERT-based model, there are 4 key concepts: Attention, Transformers, BERT, Siamese Network

InferSent: a supervised sentence embedding technique presented by Facebook AI Research in 2018

Universal Sentence Encoder: One of the most well-performing sentence embedding techniques right now, released by Google

SentenceBERT vs Universal Sentence Encoder

USE and SBERT both use transformer networks. For USE, it is sadly not clear how many layers they use (most technical details are not provided). USE was trained from scratch (as far as I can tell from the paper), while SBERT uses the BERT / RoBERTa pre-trained wights and just fine-tunes them to produce sentence embeddings.
The main difference is in the pre-training. USE uses a wide variety of data sets (exact details not provided), specifically target for generating sentence embeddings. BERT was pre-trained on a book corpus and on Wikipedia for producing a language model (see the BERT paper). SBERT than fine-tunes BERT to produce sensible sentence embeddings.
USE is in TensorFlow and tuning for your use-case is not straightforward (source code not available, you only get the compiled model from tensorflow-hub). SBERT is based on pytorch and the goal of this repository is, that fine-tuning for your use-case is as simple as possible.

Getting Started With Embeddings

Check out this tutorial with the Notebook Companion: An embedding is a numerical representation of a piece of…

huggingface.co

The example includes popular sentence embeddings model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensional). The other one is sentence-transformers/all-mpnet-base-v2 (768 dimensional)

Image embeddings

Image Similarity with Hugging Face Datasets and Transformers

In this post, you'll learn to build an image similarity system with 🤗 Transformers. Finding out the similarity between…

huggingface.co

The steps to find similar beans images with embeddings:

Extract the embeddings from the candidate images (candidate_subset), storing them in a matrix (using a fine-tuned Vision Transformer based model on the beans dataset.).
Take a query image and extract its embeddings.
Iterate over the embedding matrix (computed in step 1) and compute the similarity score between the query embedding and the current candidate embeddings. We usually maintain a dictionary-like mapping maintaining a correspondence between some identifier of the candidate image and the similarity scores (what if there are lots of candidate images?).
Sort the mapping structure w.r.t the similarity scores and return the underlying identifiers. We use these identifiers to fetch the candidate samples.

Others

There are also embeddings model for video (frame then OpenAI Clip), audio (wav2vec, PANNs), but not as popular as text embeddings. And there is general framework data2vec for transformers on different modalities, e.g. speech, vision, language.

Embedding as a service

Introducing text and code embeddings

We are introducing embeddings, a new endpoint in the OpenAI API that makes it easy to perform natural language and code…

openai.com

It has the following models
Text similarity: Captures semantic similarity between pieces of text. text-similarity-{ada, babbage, curie, davinci}-001, use cases: Clustering, regression, anomaly detection, visualization
Text search: Semantic information retrieval over documents. text-search-{ada, babbage, curie, davinci}-{query, doc}-001, use cases: Search, context relevance, information retrieval
Code search: Find relevant code with a query in natural language. code-search-{ada, babbage}-{code, text}-001, use cases: Code search and relevance

OpenAI GPT-3 Text Embeddings - Really a new state-of-the-art in dense text embeddings?

This week, OpenAI announced an embeddings endpoint (paper) for GPT-3 that allows users to derive dense text embeddings…

medium.com

OpenAI GPT-3 model performance and cost much worse than sentence transformers model in 2021.12 benchmark

OpenAI Releases Embeddings model: text-embedding-ada-002

It is Powerful, cheaper, and more flexible!

pub.towardsai.net

Use cases for OpenAI embeddings (released 2022.12)

Search (where results are ranked by relevance to a query string)
Clustering (where text strings are grouped by similarity)
Recommendations (where items with related text strings are recommended)
Anomaly detection (where outliers with little relatedness are identified)
Diversity measurement (where similarity distributions are analyzed)
Classification (where text strings are classified by their most similar label)

Improvements

One /embeddings endpoint merges the five separate models

text-similarity
text-search-query
text-search-doc
code-search-text
code-search-code

2. Features

Longer context: The context length of the new model is increased by a factor of four, from 2048 to 8192, making it more convenient to work with long documents.

Smaller embedding size. The new embeddings have only 1536 dimensions, one-eighth the size of davinci-001 embeddings, making the new embeddings more cost-effective in working with vector databases.

Reduced price. We have reduced the price of new embedding models by 90% compared to old models of the same size. The new model achieves better or similar performance as the old Davinci models at a 99.8% lower price.

OpenAI’s Embedding Model With Vector Database

The updated Embedding model offers State-of-the-Art performance with 4x longer context window. Thew new model is 90%…

betterprogramming.pub

OpenAI updated in December 2022 the Embedding model to text-embedding-ada-002. The new model offers:

90%-99.8% lower price
1/8th embeddings dimensions size reduces vector database costs
Endpoint unification for ease of use
State-of-the-Art performance for text search, code search, and sentence similarity
Context window increased from 2048 to 8192.

The article is clustering “Fine Food Reviews” dataset. Steps are:

Generate embeddings using OpenAI endpoint
Use KMeans to cluster embeddings
For reviews in the same cluster, retrieve review text (combined column Summary & Text) and send to OpenAI endpoint to summarize theme (what is common) of those reviews in the same cluster
Above does not use vector database. Now the article stores the embeddings to pinecone and query the embeddings

So you want to build an AI application powered by LLM: Let’s talk about Embedding and Semantic…

Embedding and Semantic Search for an AI application powered by LLM

blog.devgenius.io

Hugging Face hosts Massive Text Embedding Benchmark (MTEB) Leaderboard (MTEB Leaderboard — a Hugging Face Space by mteb) for measuring the performance of text embedding models on diverse embedding tasks.

Three mistakes when introducing embeddings and vector search

Representing unstructured data as embedding vectors and embedding-based retrieval (EBR) using vector search is more…

bergum.medium.com

Using pre-trained models without task-specific fine-tuning: Using the direct vector representations from the model that have only been pre-trained will not produce a useful embedding representation for any task. Search ranking is an example of such a task; see details in How not to use BERT for search ranking.
Using fine-tuned single vector embedding models out-of-domain: when we take a single vector representation model, fine-tuned on MS MARCO labels, it does not beat BM25 in a different domain with slightly different types of documents and questions. Multi-vector representation model for search, like ColBERT, generalizes much better than single-vector representations.
Lack of understanding of vector search tradeoffs: do we need to introduce an approximate nearest neighbor search (ANNS) instead of an exact nearest neighbor search? As in many aspects of life, this is a question of tradeoffs on e.g. query serving: Latency Service Level Agreement (SLA), query throughput, accuracy. The exact search for neighbors will brute-force calculate the distance between the query and all eligible documents, generally bearing high cost.

Knowledge Graph Embeddings 101

A summary of knowledge graph embeddings (KGE) algorithms

towardsdatascience.com

Word2vec efficiently learns word embeddings by training a shallow neural network to predict the context of a word included in a vocabulary, defined by a sliding window of a given amplitude, with the key idea to preserve the semantic of the words.

Knowledge graph embedding algorithms have become a powerful tool for representing and reasoning about complex structured data. These algorithms learn low-dimensional embeddings of entities and relations in a knowledge graph, allowing for efficient computation of similarity and inference tasks.

Appendix

The Ultimate Guide to Word Embeddings

Word embeddings is one of the most used techniques in natural language processing (NLP). It's often said that the…

neptune.ai

https://learn.deeplearning.ai/google-cloud-vertex-ai

BECOME a WRITER at MLearning.ai // AI Agents // Super Cheap AI.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Embeddings in Machine Learning

Hidden secret to empower semantic search

Overview

Vector Embedding 101: The Key to Semantic Search

We've all been there: typing keywords into a search engine, trying to find the one piece of information we need. But…

Vector Embeddings for Developers: The Basics | Pinecone

You might not know it yet, but vector embeddings are everywhere. They are the building blocks of many machine learning…

What are Vector Embeddings? | Pinecone

Vector embeddings are one of the most fascinating and useful concepts in machine learning. They are central to many…

Meet AI's multitool: Vector embeddings | Google Cloud Blog

Developers & Practitioners Dale Markowitz Applied AI Engineer Embeddings are one of the most versatile techniques in…

Learning embeddings for your machine learning model

How to learn embeddings representation for categorical variables.

Neural Networks: It’s all about the embeddings

Simplifying hybrid and complex models by understanding feature embeddings

The Unreasonable Effectiveness Of Neural Network Embeddings

Neural network embeddings are remarkably effective in organizing and wrangling large sets of unstructured data.

Methods to create embeddings

Embeddings — The What, the Why, and the How?

I like noticing details that no one else sees — Anonymous

Vector Embeddings: From the Basics to Production

Search capability is ingrained into our daily life. Arguments are commonly ended with the conclusion, "just google it"…

Embeddings in Machine Learning: Everything You Need to Know | FeatureForm

Embeddings have pervaded the data scientist's toolkit, and dramatically changed how NLP, computer vision, and…

Embeddings models

Text embeddings

An Overview of Different Embedding Models

Embeddings are an important component of natural language processing pipelines. They refer to the vector representation…

What are the main differences between the word embeddings of ELMo, BERT, Word2vec, and GloVe?

Answer (1 of 3): The main difference between the word embeddings of Word2vec, Glove, ELMo and BERT is that * Word2vec…

Introduction to Word and Sentence Embedding

In the field of Natural Language Processing (NLP), the use of word and sentence embeddings has revolutionized the way…

Word embeddings

GloVe, ELMo & BERT

A guide to state-of-the-art text classification using Spark NLP

Sentence embeddings

Easily get high-quality embeddings with SentenceTransformers!

Introduction to the idea of vector representations and compare TF-IDF vectors with SentenceTransformers vectors!

Sentence Transformers and Embeddings | Pinecone

Once you learn about and generate sentence embeddings, combine them with the Pinecone vector database to easily build…

Top Pre-trained Models for Sentence Embedding

This article walks through top pre-trained models to get sentence embedding, which is a lower-dimensional numerical…

Top 4 Sentence Embedding Techniques using Python!

Learn about the word and sentence embeddings Know the top 4 Sentence Embedding Techniques used in the Industry The…

Getting Started With Embeddings

Check out this tutorial with the Notebook Companion: An embedding is a numerical representation of a piece of…

Image embeddings

Image Similarity with Hugging Face Datasets and Transformers

In this post, you'll learn to build an image similarity system with 🤗 Transformers. Finding out the similarity between…

Others

Embedding as a service

Introducing text and code embeddings

We are introducing embeddings, a new endpoint in the OpenAI API that makes it easy to perform natural language and code…

OpenAI GPT-3 Text Embeddings - Really a new state-of-the-art in dense text embeddings?

This week, OpenAI announced an embeddings endpoint (paper) for GPT-3 that allows users to derive dense text embeddings…

OpenAI Releases Embeddings model: text-embedding-ada-002

It is Powerful, cheaper, and more flexible!

OpenAI’s Embedding Model With Vector Database

The updated Embedding model offers State-of-the-Art performance with 4x longer context window. Thew new model is 90%…

So you want to build an AI application powered by LLM: Let’s talk about Embedding and Semantic…

Embedding and Semantic Search for an AI application powered by LLM

Three mistakes when introducing embeddings and vector search

Representing unstructured data as embedding vectors and embedding-based retrieval (EBR) using vector search is more…

Knowledge Graph Embeddings 101

A summary of knowledge graph embeddings (KGE) algorithms

Appendix

The Ultimate Guide to Word Embeddings

Word embeddings is one of the most used techniques in natural language processing (NLP). It's often said that the…

BECOME a WRITER at MLearning.ai // AI Agents // Super Cheap AI.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Xin Cheng