Evaluation concepts
Evals
Repeatable tests that measure whether a model, prompt, tool, or agent workflow performs acceptably for a defined task.
Evals turn subjective model impressions into measurable checks. They can test correctness, format compliance, reasoning quality, tool-use behavior, safety, latency, and cost.
Good evals are tied to a decision. A model upgrade eval should answer whether to switch models. A regression eval should answer whether a prompt or code change broke a known behavior. Without that decision link, evals become dashboards that are easy to ignore.