Vibe Check

6.1. Vibe Check#

Believe it or not, one of the most popular ways to evaluate language models is through a good old-fashioned vibe check. Yes, we’re talking about having real humans read the model’s outputs and rating them on various factors like coherence, relevance, and overall quality.

Vibe Check

A vibe check is evaluating LLMs by using them directly by yourself.

Vibe checks are particularly popular for evaluating conversational language models like chatbots and digital assistants. After all, these models are designed to interact with humans in a natural, back-and-forth manner. And what better way to assess that than by, well, having humans interact with them? Every person has their own criteria for what makes a good conversation, so vibe checks allow developers to quickly get a sense of how their model is performing from multiple perspectives.

Here’s how to do it:

Take a deep breath
Talk to the LLM
Get the vibe. Done!

Vibe checks are far from perfect. They’re subjective, prone to annotator biases, and can be influenced by factors like prompt phrasing and output length. They’re not exactly scalable – you can only get so many human ratings before it becomes too time-consuming and expensive.

Even so, the popularity of the vive check shows how difficult it is to evaluate LLMs in other ways.