Sub-word Tokenizers

3.2. Sub-word Tokenizers#

Let’s keep using the same example sentence.

"I'm instantiating a tokenizer for my LLMs"

GPT-4 tokenizer splits the text as follows:

I / 'm / instant / iating / a / tokenizer / for / my / L/ LM / s

Compare this to word-based tokenizers:

Sub-words are smaller.
- It’s better to split I + 'm because they’re smaller yet meaningful units of text than I'm.
- Similarly, it’s better to split instant + iating because they’re smaller yet meaningful units of text.
- Same, L + LM + s > LLMs.
Smaller units lead to a smaller vocab size, hence better.

Indeed, GPT-4 uses their tokenizer named cl100k_base, whose vocab size is only 100k but can represent many more words than possibly a 1M-vocab word-based tokenizer.

3.2.1. Training Sub-word Tokenizers#

This is omitted in this book :p It’s called “Byte-Pair Encoding” tokenizers, go study it yourself!

My lazy summary:

As in the example above, all the current LLMs use some kind of sub-word tokenizers.
You need to train these sub-word tokenizers. It’s not too computationally heavy to train tokenizers.
By also adopting the “byte fallback” property, out-of-vocabulary tokens are split into byte representations and passed to the LLMs.
- This remedy enables LLMs to also process multiple languages, although it makes the tokenized result too long for those non-primary languages.

3.2.2. Pros#

Sub-words are more efficient and effective than words.

3.2.3. Cons#

It still simply maps from texts through token indices to embeddings. This blocks the language model from understanding anything before the tokenization step.
- For example, the model has little visibility into the spellings of text. To be discussed later in this tutorial.

Sub-word Tokenizers

Contents

3.2. Sub-word Tokenizers#

3.2.1. Training Sub-word Tokenizers#

3.2.2. Pros#

3.2.3. Cons#