Every: “AI isn’t like other software. Historically, when you used a computer, input X would always give you output Y. With AI, you give it input X and get back output I, T, D, E, P, E, N, D, S. The computer responds with a marginally different answer every time, so evaluating and comparing these tools is a devilishly hard task. But because we are devoting ever more of our intellectual effort to LLMs, this is a task we must figure out. Michael Taylor has spent an obscene amount of time trying to build AI evaluations, and in this article, he tells how you can do the same. As go the evals, so goes the world. —Evan Armstrong To paraphrase Picasso, when AI experts get together, they talk about transformers and GPUs and AI safety. When prompt engineers get together, they talk about how to run cheap evals.Evals, short for “evaluation metrics,” are how we measure alignment between AI responses and business goals, as well as the accuracy, reliability, and quality of AI responses. In turn, these evals are matched against generally accepted benchmarks developed by research organizations or noted in scientific papers. Benchmarks often have obscure names, like MMLU, HumanEval, or DROP. Together, evals and benchmarks help discern a model’s quality and its progress from previous models. Below is an example for Anthropic’s new model, Claude 3.”
Sorry, comments are closed for this post.