Calibration
Understand how Autouser calibration works and how to interpret agreement scores.
What is Calibration?
Calibration measures how well an autouser's ratings align with human evaluator ratings. A well-calibrated autouser produces ratings that are consistent with what human evaluators would give, making it a reliable proxy for human judgment.
How Calibration Works
- Collect human ratings on an evaluation.
- Run the autouser on the same evaluation.
- Autousers computes agreement between human and Autouser ratings using statistical measures.
- The calibration score shows how closely the autouser matches human judgment.
Interpreting Calibration Scores
- High agreement (>0.7): The autouser closely matches human ratings and can be used confidently.
- Moderate agreement (0.4-0.7): The autouser is directionally correct but may differ on specific dimensions.
- Low agreement (<0.4): The autouser needs prompt adjustments or may not suit this evaluation type.
Improving Calibration
If an autouser has low calibration, try refining its system prompt to better match your evaluation criteria. Ensure the rating rubrics are clear and specific, as ambiguous rubrics lead to inconsistent ratings from both humans and AI.
Calibration scores are evaluation-specific. An autouser may be well-calibrated for one type of evaluation but not another. Re-check calibration when using an autouser in a new context.
Was this article helpful?