Evaluation

6. Evaluation#

“All evals are so noisy, but some are useful.”

  • Keunwoo Choi, 2024-05-04.

As in the highly respected quote above, it is extremely tricky to evaluate LLMs, because..

  • A lot of LLM use-cases are subjective

  • The power of LLMs lies in generative tasks, which have no single correct answer

  • LLMs can be universally used, so we’d want to evaluate them across all the domains, which are difficult.

Yet we need to evaluate them, especially after spending tons of GPUs and $$. How?