Skip to content

Embeddings5 min read

Recent successes in machine learning owe their power to transfer and representation learning. Transfer learning allows us to fine-tune a model trained on one task to a new task using fewer training examples and improving accuracy. Representation learning is allowing a system to discover the representations required for feature detection or classification from raw data. (David et al. 2019) defined representations as more compact, lower-dimensional, and dense vectors learned from the data, they are commonly referred to as embeddings.

Embeddings are at the center of recent machine learning breakthroughs including Word2Vec, Pin2Vec, and OpenAI CLIP. They are also the core idea in model pretraining which has seen successes in natural language processing, computer vision, and audio. We look at some examples below where representation learning resulted in increases in performance, and breakthroughs in language modeling, forecasting, and image representations.

Embeddings have been particularly useful when dealing with categorical variables in structured data. A dataset containing product names, customer types, etc can unlock more insights by mapping each unique value to a higher dimensional embedding vector which captures intricate relationships between items in the dataset. (Alexandre et al, 2015) and (David et al. 2019) used this technique to obtain state-of-the-art results in taxi and fashion demand forecasting tasks.


Researchers (Tomas et al, 2013) set out to create vectorized representations of text to enable machine learning on unstructured textual data. Each unique word in the corpus is represented in an embedding matrix where each row contains a vector of embeddings for the words. At the beginning of training, the vectors are random numbers, but they are learned via backpropagation to become efficient representations in Euclidean space. 

There are two approaches to training. A continuous bag of words predicts a target word from a list of context words, by taking the distributed representations of the context word vectors. With a continuous skip-gram model, a simple neural network with one hidden layer is trained to predict the probability of a word being present when an input word is present. Both these approaches rely on self-supervision where data is not labeled but rather takes advantage of omitting information to a model and using that to update representations.

The result of training these models is that vectors of similar words are grouped together. The semantic relationships between words are captured by the model where similar words end up closer than dissimilar words. The model was also able to capture context and even definitions of words from the data. See below a sample showing these semantic relationships being discovered.

Word2Vec demonstrated that we can learn useful representations of natural language by training neural networks in a self-supervised way. These learned representations can be used to perform text similarity checks, sentiment analysis, and recommendations. 


(Kevin Ma, 2017) used a similar approach as Word2Vec to create embeddings for board pinning activity at Pinterest. They used user pin engagements as a series of pins interacted with like a sentence in Word2Vec. Embeddings were then learned for these pins and then using nearest neighbor search pin similarity could be obtained.

Demand Forecasting

(David et al. 2019) needed to forecast demand for highly differentiated products for a fashion brand. A key insight in their research was that a lot of new products share similarities with previously launched items and therefore, embeddings would be a natural way to find similarities in vector space. 

First, they trained a feed-forward neural network to predict the quantity sold from customer and product embeddings. This was not the final task, but a pre-task in order to get useful representations of customers and products. Each value of a categorical variable was mapped to a d-dimensional continuous vector creating a matrix E of dimensions C x d (where C is the number of classes i.e. the cardinality). At each input, a vector was looked up in the matrix to get the embedding input, which was fed into the network and updated via backpropagation.

Once these embeddings had been learned, they could be used to perform nearest-neighbor searches to identify similar products over time and even generalize to unseen products in the future test set. These products were used to create inputs for a second recurrent neural network which was able to achieve higher forecasting accuracy than previous approaches.


Learning useful representations from data is a powerful way to model categorical variables and language. Successful embedding maps input data into Euclidean space where similar items are closer hence mapping semantic relationships between data items. These embeddings can be used as further input to other machine learning systems exposing never seen relationships in the data.


Davide Mezzogori, Francesco Zammori. An Entity Embeddings Deep Learning Approach for Demand Forecast of Highly Differentiated Products,

Procedia Manufacturing,Volume 39,2019,Pages 1793-1800,ISSN 2351-9789.

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space, 2013.

Alexandre de Brébisson, Étienne Simon, Alex Auvolat, Pascal Vincent, Yoshua Bengio. Artificial Neural Networks Applied to Taxi Destination Prediction, 2015

Engineering, Pinterest. “Applying Deep Learning to Related Pins.” Medium, The Graph, 1 Mar. 2017,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever: “Learning Transferable Visual Models From Natural Language Supervision”, 2021;

Leave a Reply