Agreement Metrics
Understand inter-rater agreement, Cohen's Kappa, and what they tell you about evaluation reliability.
Why Agreement Matters
Inter-rater agreement tells you how consistently evaluators rated the same experience. High agreement means your rating dimensions and rubrics are well-defined and evaluators interpret them similarly. Low agreement may indicate ambiguous criteria.
Cohen's Kappa
Cohen's Kappa is a statistical measure of agreement between two raters that accounts for chance agreement. Autousers extends this to multiple raters using Fleiss' Kappa for group agreement measurement.
Interpreting Kappa Scores
- 0.81-1.00: Almost perfect agreement
- 0.61-0.80: Substantial agreement
- 0.41-0.60: Moderate agreement
- 0.21-0.40: Fair agreement
- 0.00-0.20: Slight agreement
- Below 0.00: Less than chance agreement
Improving Agreement
If agreement is low, review your rubrics for ambiguity. Provide clearer criteria for each rating level and consider running a calibration session where evaluators discuss their interpretations before rating independently.