Introduction to Embeddings and embedding

Introduction to Embeddings and Embedding model

You recently started your AI journey and keep hearing terms like embeddings and encoding. These concepts are way more critical than you think. For instance, do you know LLMs like ChatGPT, Gemini, and DeepSeek play with embeddings to understand your prompts? The ability to understand prompts dramatically depends on the quality of embeddings. Let’s explore why.

What are Embeddings?

In simple terms, embeddings are vector representations in a vector space. Embeddings are not limited to words, they can be of other inputs like sentences, images, and graphs. Embeddings represent high-dimensional data like texts and images into vectors of low dimension. With this, the task of processing this complex data becomes easier for models that only understand continuous numbers.

Vectors are the key here. In computer science, vectors are represented in an array where [a] is a 1-dimensional vector, [a, b] is 2-dimensional, and so on. From a mathematical view, vectors can be added, which applies to embeddings too. Just like adding vectors gives us other vectors, adding embeddings will provide another embedding.

Types of Embeddings:

Word Embeddings:

Word embeddings are vector representations of words in a vector space where each words are given a vector. In the vector space, related words are close to each other. Let’s say if “man” is represented with [0.5, 5.5, -0.7]. Since man and woman are semantically similar words, “woman” would have a vector embedding similar to it. Also, hereby performing arithmetic operations on such vectors, we can attain similar vectors. E.g “king” – “man” + “woman” = “queen”. Some popular models for word embeddings are Word2vec, GloVe, and FastText.

Sentence Embeddings:

Sentence embeddings are vector representations of whole sentences. Unlike word embeddings, the whole sentence is mapped in a vector space, and semantically similar sentences are closer in the vector space. Models like InferSent, and Doc2vec(extension of Word2vec) are used to generate sentence embeddings.

Image Embeddings:

Images can also be transformed into vectors, that is exactly what you’ll call image embeddings. CNNs (Convolutional Neural Networks) are best for generating image embeddings, which are later used for tasks like image classification and image retrieval. 

Audio and speech Embeddings:

Audio and speech embeddings are generated by converting the raw audio and speech data into vectors that can be suitable for tasks like speech recognition and emotion detection. VGGish and Wav2vect are models dedicated to such embeddings.

Why do we need Embeddings?

The problem with raw, categorical, or high-dimensional data

Before embeddings were ubiquitous, encoding techniques like one-hot encoding were used to represent categorical variables. However, this technique has limitations. Let’s say we have a small vocabulary of 5 words:

Cat“, “dog“, “fish“, “bird“, “horse

One-hot encoding works by generating a binary representation for each class where the position corresponding to the word is represented by 1. So we get a representation similar to the table below.

Such representations are called sparse vectors, and these were the foundations for early word embedding techniques. So this works fine for a small number of classes, but what if we have all the words in the English language? How big will this table be? So big that the computation will be too high and messy. Not only that, while extracting the semantics out of words, the model will be confused. 

This problem is solved by embeddings, which are also called dense vectors. The embeddings not only numerically represent but are also able to capture the semantics of words by introducing a distance concept where words with similar semantics have a very small distance. For similarity measurements, cosine similarity, Euclidean distance, Manhattan distance, and several others can be used.

Real World Applications

Embeddings have become foundational across NLP, recommendation systems, and computer vision. Their power lies in transforming raw, high-dimensional data into dense vectors that encode contextual, semantic, or behavioral relationships, enabling machines to reason more effectively about language, users, and visual content. 

Text Search:

Embeddings are key for any retrieval tasks that involve retrieving similar documents based on a given query. Embeddings and embedding models are a crucial part of the RAG( Retrieval Augmented Generation) architecture, which is a great approach to prevent LLMs from hallucination. 

Recommendation system:

In a recommendation system, whether it’s for movies, food, or clothes, embedding models are used to represent them in vectors. They are stored in a vector space and can be compared to recommend similar ones.

Sentiment Analysis:

Sentiments are very abstract for models to detect, but using embeddings and capturing sentiment-related features can ease the process. Positive words or sentences have similar embeddings, which can differentiate them from negative word embeddings.

Evolution of Embedding Models

From one-hot to word2vec:

One-hot encoding was the most primitive way of representing words in a vector, where only the corresponding word was represented as one and others as zeros. This approach was succeeded by TF-IDF (Term Frequency- Inverse Document Frequency) in an attempt to capture the importance of words based on their frequency in a document(sentences or phrases) and across all the documents. 

Sparck Jones proposed it in his paper “A Statistical Interpretation of term specificity and its application in retrieval.” TF-IDF captured more useful information than one-hot encoding, but still could not capture the semantics. 

Word2vec was a revolutionary technique first proposed in a paper titled “Efficient Estimation of Word Representations in Vector Space”. It was suggested by Tomas Mikolov and colleagues at Google and published in 2013.

It uses a shallow neural network to capture the linguistic context of words from a large corpus of text. It produces an embedding that maps words to a vector space, typically of a few hundred dimensions. Cosine Similarity is used to measure the similarity between the embeddings.

Static Vs Contextual Embedding Models

Models like Word2Vec, GloVe, and fastText are effective at generating dense vector representations of words, known as embeddings, which capture semantic relationships. Word2Vec, in particular, learns these embeddings using one of two architectures: CBOW (Continuous Bag of Words) or Skip-gram.

However, the embeddings produced by these models are static, meaning each word has a single representation regardless of context. As a result, they struggle with polysemy—where a word has multiple meanings. For example, they cannot distinguish between the word bat in the sentences:

“He bought a new bat to play cricket.”

“Bat flies at night.”

Contextual nuances are lost because the embedding is based solely on word co-occurrence statistics in a fixed window of text, rather than the full context of a sentence.

In such scenarios, a contextualized model like BERT(Bidirectional Encoder Representations from Transformers) excels where word embeddings are generated based on the context of surrounding words. BERT’s bidirectionality allows it to look for context at both the left and right context of a word during training. So, the embeddings BERT produces are contextual, that is, words can have different embeddings based on their context. This makes it very powerful in generating robust embeddings that can retain contextualized semantics.

Key Challenges and Limitations

Embedding models is a game-changing concept, but it has ethical considerations. The models may learn the bias in the training data that can lead to unfair or discriminatory outcomes in their applications. So, realizing and mitigating such bias becomes a crucial part of developing a safe and ethical AI system. 

Bias in Embeddings: 

Textual data that human produces are inherently biased. So, while training embedding models to learn the semantics and contexts, such bias gets slipped. A common example is associating “doctors” with males and “nurses” with females, reflecting societal stereotypes. These biases can lead to unfairness and discrimination in real-world applications like a recommendation system or a hiring system.

To mitigate such biases, techniques like debiasing embeddings can be adapted, which remove or neutralize biased dimensions. Regular testing for bias and fairness while introducing diverse and representative training data is a must.

Transparency and Accountability:

Transparency and accountability are another aspect that needs to be considered while dealing with embedding models. The advanced embedding model represents data in dimensions that range in the hundreds, which is incomprehensible and affects the outcomes of AI systems. Hence, developers should focus on being transparent about training data and the choice of models.

Conclusion 

Embedding models are the cornerstone of modern AI that allows powerful models to process high dimensional data which was previously impossible. The evolution of word embeddings from Word2Vect and GloVe to state-of-the-art models like BERT and GPT has enabled new possibilities in NLP, computer vision, and recommendation systems. 

As current models start to evolve and shape the world, it becomes inevitable to understand embedding models. Understanding their use cases equips us to build powerful AI systems that transform conventional tasks.

Leave a Reply

Your email address will not be published. Required fields are marked *