Autousers
handshake

Agreement Metrics

Understand inter-rater agreement, Cohen's Kappa, and what they tell you about evaluation reliability.

Why Agreement Matters

Inter-rater agreement tells you how consistently evaluators rated the same experience. High agreement means your rating dimensions and rubrics are well-defined and evaluators interpret them similarly. Low agreement may indicate ambiguous criteria.

Cohen's Kappa

Cohen's Kappa is a statistical measure of agreement between two raters that accounts for chance agreement. Autousers extends this to multiple raters using Fleiss' Kappa for group agreement measurement.

Interpreting Kappa Scores

  • 0.81-1.00: Almost perfect agreement
  • 0.61-0.80: Substantial agreement
  • 0.41-0.60: Moderate agreement
  • 0.21-0.40: Fair agreement
  • 0.00-0.20: Slight agreement
  • Below 0.00: Less than chance agreement

Improving Agreement

If agreement is low, review your rubrics for ambiguity. Provide clearer criteria for each rating level and consider running a calibration session where evaluators discuss their interpretations before rating independently.

infoAgreement between human raters and Autousers provides a calibration measure for your AI evaluation setup. High human-AI agreement suggests the Autouser is a reliable proxy.
Was this article helpful?