tune

Calibration

schedule6 mincalendar_todayLast updated 2026-02-13

Understand how Autouser calibration works and how to interpret agreement scores.

What is Calibration?

Calibration measures how well an autouser's ratings align with human evaluator ratings. A well-calibrated autouser produces ratings that are consistent with what human evaluators would give, making it a reliable proxy for human judgment.

How Calibration Works

Collect human ratings on an evaluation.
Run the autouser on the same evaluation.
Autousers computes agreement between human and Autouser ratings using statistical measures.
The calibration score shows how closely the autouser matches human judgment.

Interpreting Calibration Scores

High agreement (>0.7): The autouser closely matches human ratings and can be used confidently.
Moderate agreement (0.4-0.7): The autouser is directionally correct but may differ on specific dimensions.
Low agreement (<0.4): The autouser needs prompt adjustments or may not suit this evaluation type.

Improving Calibration

If an autouser has low calibration, try refining its system prompt to better match your evaluation criteria. Ensure the rating rubrics are clear and specific, as ambiguous rubrics lead to inconsistent ratings from both humans and AI.

warningCalibration scores are evaluation-specific. An autouser may be well-calibrated for one type of evaluation but not another. Re-check calibration when using an autouser in a new context.

Was this article helpful?

Calibration

What is Calibration?

How Calibration Works

Interpreting Calibration Scores

Improving Calibration

Related Articles

Agreement Metrics

What are Autousers?