Multimodal LLMs

1. Multimodal LLMs#

Multimodal LLMs refer to various types of LLMs that can understand (and even generate, sometimes) other modalities such as images, audio, speech, music, videos, etc.

Multimodal LLMs have gained enough attention already. Products such as ChatGPT-4 support image understanding and generation, and there are open source models with similar abilities such as LLaVA.

As MIR researchers, we’d ask – how about music / audio? We’ll cover that in this chapter, with a limited scope of music audio understanding; not generation.