Hacker News

Understanding Cohen's Kappa in Machine Learning

by CarrieLabon 12/1/2021, 8:23:00 PM with 1 comment

by CarrieLabon 12/1/2021, 8:25:58 PM
I often see subtle misuses of interrater reliability metrics.
For example, imagine you're running a Search Relevance task, where search raters label query/result pairs on a 5-point scale: Very Relevant (+2), Slightly Relevant (+1), Okay (0), Slightly Irrelevant (-1), Very Irrelevant (-2).
Marking "Very Relevant" vs. "Slightly Relevant" isn't a big difference, but "Very Relevant" vs. "Very Irrelevant" is. However, most IRR calculations don't take this kind of ordering into account, so it gets ignored!
Cohen's kappa is a rather simplistic and flawed metric, but a good starting point to understanding interrater reliability metrics.