2.2. Models#
All the decoder-only transformers look really similar.
LLaMA-1-7B: 4096-dim, 32-heads, 32-layers, rotary positional encoding, 50k vocab
Gemma-7B: 3072-dim, 16-head, 28-layers, 256_128 vocab
Mistral-7B: 6144-dim, 32-heads, 32-layers, 32k vocab
There are additional differences, which are out of the scope of this tutorial.