Autousers
query_stats

Understanding Statistics

Learn about statistical metrics, confidence intervals, and how to interpret evaluation data.

Key Metrics

The Statistics tab provides several metrics to help you understand your evaluation results beyond simple averages.

Mean and Standard Deviation

The mean is the average rating for a dimension. The standard deviation shows how spread out the ratings are. A low standard deviation means evaluators largely agreed; a high one indicates diverse opinions.

Confidence Intervals

Confidence intervals show the range within which the true average rating likely falls. A narrow interval means the result is precise; a wide interval suggests you may need more ratings for a reliable conclusion.

Statistical Significance

For SxS evaluations, statistical significance tests tell you whether the difference between Experience A and B is meaningful or could be due to chance. Results are marked as statistically significant when the p-value is below 0.05.

info
How Many Ratings?
As a general rule, 5-7 ratings per evaluation provide reasonable confidence for most decisions. For high-stakes decisions, aim for 10+ ratings to narrow confidence intervals.

Effect Size

Effect size measures the magnitude of the difference between experiences. Even a statistically significant difference may be small in practice. Look at effect size alongside significance to make informed decisions.

Was this article helpful?