← Glossary
Evaluation concepts

Evals

Repeatable tests that measure whether a model, prompt, tool, or agent workflow performs acceptably for a defined task.

Evals turn subjective model impressions into measurable checks. They can test correctness, format compliance, reasoning quality, tool-use behavior, safety, latency, and cost.

Good evals are tied to a decision. A model upgrade eval should answer whether to switch models. A regression eval should answer whether a prompt or code change broke a known behavior. Without that decision link, evals become dashboards that are easy to ignore.

Related terms
agent-evaluationbenchmark-suiteprecisionrecalldefinition-of-done