2. Multimodal Input Processing#
2.1. Recap: Text#
A typical, text-only LLM is a stack of self-attention layers, with an embedding layer on the input side and the lm_head (token prediction) layer on the output side.
Layer |
Input |
Output |
---|---|---|
nn.Linear (token embedding) |
(batch=*, sequence=2048, token_index: int) |
(batch=*, sequence=2048, emb_dim=4096 |
Self-Attention 1 |
(batch, sequence, emb_dim) |
(batch, sequence, emb_dim) |
Self-Attention 2 |
(batch, sequence, emb_dim) |
(batch, sequence, emb_dim) |
Self-Attention 3 |
(batch, sequence, emb_dim) |
(batch, sequence, emb_dim) |
… |
(batch, sequence, emb_dim) |
(batch, sequence, emb_dim) |
Self-Attention 31 |
(batch, sequence, emb_dim) |
(batch, sequence, emb_dim) |
Self-Attention 32 |
(batch, sequence, emb_dim) |
(batch, sequence, emb_dim) |
nn.Linear -> nn.Softmax (lm_head) |
(batch, sequence, emb_dim) |
(batch, sequence, n_token=50_000) |
In such a LLM, text data is handle first by a tokenizer; then the embedding layer. Only then, text is represented as a (sequence of) vector(s); it is embedded on a vector space; which is then digested by the Transformer, a large language model. Before training, this structure – somewhat effective tokenization and vector representation – gives a room for the vector sequence to become semantically meaningful.
While training, the embedding layer is trained simultaneously, to output a semantically meaningful representation of the text,
text -> [tokenizer] -> integer seq -> [embedding layer] -> vector seq -> [Transformers]
2.2. Image: Encoder and Adaptor#
Likewise, we need some way to process an image as an input of a LLM. Let’s break this process into an image encoder and an image adaptor. On a high level, this process is the same as we did to the text. As a result, we would get a vector representation
Encoder outputs a vector representation (or a sequence of vector)
Adaptor is usually a linear layer (matrix multiplication)
This process is illustrated nicely in [LLWL24], where visual-input instruction task was performed (e.g., visual question-answer and image captioning).
At the end, the Transformer architecture (