Understanding Large Language Models : Chapter 1- Transformers and Transformer block.

Viraj Kadam
6 min readJul 19, 2024

--

Large Language Models (LLMs) have taken the world by storm,especially after the public release of GPT based Chat-GPT. The application has itself found a lot of users, across a lot of industries and use cases.

In this series of blogs, we will attempt to understand how Large Language Models work,how they are trained, what data are they trained on, how to use it for our applications and how to evaluate them. This will be a hands-on course along with theory behind Large Language Models.

What is even a Language model?

A language model is a mathematical model that assigns a probability distribution over a sequence of tokens. In a simple words, the model must be able to predict the next word appropriately, given a sequence of previous words.

This may seem straight forward, but is not. For the language model to possess this ability, it should have linguistic capabilities and also have some world knowledge.

Bur how does a language model convert text into some representation that it can understand? Here come the concept of Tokenization and Embeddings.

Tokenization and Embedding : Representing text numerically.

Before we get into the modelling side of things,let us understand how does the model understand words. The Neural network model is a mathematical model. So we need find a way to represent the text numerically.But before we can represent a text numerically, we need to break down the text into individual segments. We can call this individual segment as a token.

Tokenization

The process of breaking down the text into individual meaningful segments is called as tokenization. It is a essential process in an nueral network NLP model. There are a few different ways to go about tokenization , the basic one being whitespace tokenization where you split sentences on whitespaces. We will discuss other tokenization techniques in detail in a further blog.

Wordpiece tokenization (from : https://botpenguin.com/glossary/wordpiece-tokenization)

You can see tokenization in action here: link

Tokenization using the gpt4 tokenizer

Word Embedding

Now that we have a modular representation of elements of a text, the next problem is to have a meaningful mathematical representation of those elements. Models are actually mathematical models, and need mathematical inputs to work with. So how do we convert the text tokens to a meaningful mathematical representation that a model can work with? Here comes the concept of word embedding.

With word embedding, we essentially aim to transform the token into a vector that represents some properties of that word in some latent (hidden) mathematical space. These representations may not necessarily translate into meaningful representations to us. But these embeddings have interesting properties. And along with the model, these embedding representations are also learned. Let us explore some interesting properties of word embeddings.

  1. Words which are semantically closer in meaning have a high similarity between them, than words which are not similar
The similarity of words ‘dog’ and ‘pup’ should be high, as they semantically mean a similar thing.

2. Algebric operations performed on word vectors yield vectors that are similar to the words we expect on semantic operations on these vector.

The algebric subtraction between the word vectors of king and man will yield a vector that is highly similar to the word vector of queen.
  1. We can find out certain biases in the dataset using algebric operations on word vectors.

Transformers: The architecture type that has lead to enormous improvement in Natural Language Processing Tasks.

If you look at most of the models making headlines today, almost all of them have one thing in common, a transformer based architecture.

Google made a introduction to the transformer based architecture with the paper Attention is all you need . This has revolutionised the Natural Language processing industry, and transformers became the backbone of advancement in NLP.

Transformers are parallel: Earlier models used in NLP were majorly Recurrent Neural Network based models, where the operations happen sequentially. Transformers has a parallel architecture, so that all the input tokens are fed in the model at once.

Problems with previous Recurrent Models

  • Linear interaction distance: Rolled left to right, local words affect each others meaning. Which means recent words have a much higher effect on the model predictions than words earlier in the sentence.
  • Hard to learn long distance dependencies. As more importance is given to recent words, the model does not capture long term dependencies.
  • Forward and backward passes do not have parallel operation,and hence have to take n steps (where n is the number of tokens in the sentence). Hence GPU optimization is not used appropriately, and training on very large datasets is difficult.

Attention block and self attention mechanism

The core component of transformer type models is the attention block, which works on the attention mechanism. Lets look at the attention mechanism before we delve into the architecture of the transformer model.

The Attention Mechanism

We can think of attention like a fuzzy lookup in key value store.

  • In a lookup table we have keys that map to values, the query matches the key returning its value.
  • In attention , the query matches all the keys softly to a weight between 0 and 1. The key’s values are weighted and summed.

Self Attention

For a sentence, let all the embedding of all the n words of the sentence be X. Then X will be a matrix of shape n rows and d columns. Here d is the dimension of the embedding.

therefore the Key, Query and Value matrix is calculated as
q = QX
k = KX
v = VX

Where q is the Query matrix, k is the Key Matrix and v is the value matrix. The final value is : Output = Softmax(qk)v

Intuition behind self attention

The attention mechanism allows us to aggregate data from many (key,value) pairs. So the output token embedding of any word in a input text, is the weighted sum of all the tokens in that input. We use the key and query matrices to form the attention weights, and softmax of the values ensures that the weights all add up to 1. The weights are then multiplied with the value matrix to form the weighted representation of that token.

Source : Attention is all you read (paper)

Problems with using just the self attention

  • Doesn’t have inherent notion of order. since attention does not have any inherent order information, we need to encode the order in the embedding of the tokens.
    Solution : Add Positional embedding to the token embedding, to encode the positional information for that token. So the updated embedding are : Xi_new = Xi + Pi , where Xi_new is the updated embedding for token i, Xi is the original embedding and Pi is the positional embedding for the token i.
  • No non — linearity for deep learning : stacking more self attention layer just re averages the value vectors.
    Solution : Add a MLP (dense layer, with non linear activation) to process each output vector.
  • Ensure we do not peek into the future to predict a sequence:
    Solution : masked self attention (we will study this in later chapters of this series)
Positional embeddings along with token embeddings. (from : https://paperswithcode.com/method/wordpiece)

The Transformer Model

Source : Attention is all you need (paper)

Components of the transformer model

  • The multi-head self attention : Multi head self attention involves having multiple parallel attention heads.
  • The positional encoding layer: Encodes the positional information of the tokens along with the embedding of the token.
  • MLP layer after the self attention block (Feed forward layer) : Introduces non-linearity in the model.
  • The Residual connection : The trick to train the model better. Residual connection (skip connection) enable model to learn better by retaining previous information and helping the gradient propagate better.
  • Layer Normalization : Helps model train faster. The idea is to cut un-informative information in the hidden layers by normalizing to unit mean and standard deviation.

Resources for further reading

--

--