1.1. Definition#
First of all, what is a language model?
Language Model
A language model is a model that learns patterns from language data.
Remember this definition does not imply any virtues we may want from language models, i.e., factual, responsible, or explainable.
Ok, so - what is a large language model? Defining LLM is as impossible as defining jazz music 😉 (ok perhaps a bit easier.) Let me fail in this way:
Large Language Model
A large language model is language models that are large enough, usually with more than a billion parameters, to demonstrate zero-shot and few-shot abilities.
See the table below and the trend.
Name |
Number of parameters |
Birth year |
---|---|---|
BERT (medium) |
0.047B |
2018 |
BERT (base) |
0.110B |
2018 |
BERT (large) |
0.340B |
2018 |
BERT (xlarge) |
1.270B |
2018 |
Some Google Translate models (LSTMs) |
0.160 - 0.380B |
2016-2020 (estimate) |
GPT-1 |
0.117B |
2018 |
GPT-2 |
1.5B |
2019 |
GPT-3 |
1.3 - 175B |
2020 |
BERT (xlarge)[DCLT18] is over 1B, but it doesn’t perform any zero-shot or few-shot abilities as it still remains to be a word and document embedding model.
GPT-1 [RNS+18] is way under 1B and didn’t claim any few-shot or zero-show generalizeability (its title is Improving language understanding by generative pre-training.)
Being 1.5B, GPT-2 [RWC+19] was suddenly too powerful that OpenAI famously said “Due to our concerns about malicious applications of the technology, we are not releasing the trained model”. And what was the title? Language models are unsupervised multitask learners.
A year later, GPT-3 [BMR+20] was present in a paper titled Language Models are Few-Shot Learners.