Evaluation

6. Evaluation#

“All evals are so noisy, but some are useful.”

Keunwoo Choi, 2024-05-04.

As in the highly respected quote above, it is extremely tricky to evaluate LLMs, because..

A lot of LLM use-cases are subjective
The power of LLMs lies in generative tasks, which have no single correct answer
LLMs can be universally used, so we’d want to evaluate them across all the domains, which are difficult.

Yet we need to evaluate them, especially after spending tons of GPUs and $$. How?