handshake

Agreement Metrics

schedule5 mincalendar_todayLast updated 2026-02-13

Understand inter-rater agreement, Cohen's Kappa, and what they tell you about evaluation reliability.

Why Agreement Matters

Inter-rater agreement tells you how consistently evaluators rated the same experience. High agreement means your rating dimensions and rubrics are well-defined and evaluators interpret them similarly. Low agreement may indicate ambiguous criteria.

Cohen's Kappa

Cohen's Kappa is a statistical measure of agreement between two raters that accounts for chance agreement. Autousers extends this to multiple raters using Fleiss' Kappa for group agreement measurement.

Interpreting Kappa Scores

0.81-1.00: Almost perfect agreement
0.61-0.80: Substantial agreement
0.41-0.60: Moderate agreement
0.21-0.40: Fair agreement
0.00-0.20: Slight agreement
Below 0.00: Less than chance agreement

Improving Agreement

If agreement is low, review your rubrics for ambiguity. Provide clearer criteria for each rating level and consider running a calibration session where evaluators discuss their interpretations before rating independently.

infoAgreement between human raters and Autousers provides a calibration measure for your AI evaluation setup. High human-AI agreement suggests the Autouser is a reliable proxy.

Was this article helpful?

Agreement Metrics

Why Agreement Matters

Cohen's Kappa

Interpreting Kappa Scores

Improving Agreement

Related Articles

Understanding Statistics

Calibration