Transformer

2.1. Transformer#

LLMs are usually built with the Transformer architecture [VSP+17].

There are almost nothing else but many self-attention layers in a Transformer-based LLM. For example, see this llama-1 6.7B architecture.

Layer	Input	Output
nn.Linear (token embedding)	(batch=*, sequence=2048, token_index: int)	(batch=*, sequence=2048, emb_dim=4096
Self-Attention 1	(batch, sequence, emb_dim)	(batch, sequence, emb_dim)
Self-Attention 2	(batch, sequence, emb_dim)	(batch, sequence, emb_dim)
Self-Attention 3	(batch, sequence, emb_dim)	(batch, sequence, emb_dim)
…	(batch, sequence, emb_dim)	(batch, sequence, emb_dim)
Self-Attention 31	(batch, sequence, emb_dim)	(batch, sequence, emb_dim)
Self-Attention 32	(batch, sequence, emb_dim)	(batch, sequence, emb_dim)
nn.Linear -> nn.Softmax (lm_head)	(batch, sequence, emb_dim)	(batch, sequence, n_token=50_000)

There are other ways for using self-attention layers. However, the recent LLMs are all alike; They are also called decode-only transformers.

2.1.1. Token embedding#

It is a simple embedding layer, i.e., a trainable look-up table mapping token_index –> token_embedding.

2.1.2. Self-attention#

The strength of the Transformer comes from its unique self-attention mechanism. Since I’m not better than Andrej Karpathy at teaching, I’ll send you to his lecture video on GPT, 42:15 for more details.

Here’s my very lossy tl;dr:

A self-attention layer maps a sequence of vector to another sequence of vector.
- It does it by comparing every combination of all the time steps
  - So, it’s memory-expensive, but powerful.

2.1.3. LM Head#

This is where the hidden vectors come back to the token space.

Transformer

Contents

2.1. Transformer#

2.1.1. Token embedding#

2.1.2. Self-attention#

2.1.3. LM Head#