Understanding Cohen's Kappa in Machine Learning

  • I often see subtle misuses of interrater reliability metrics.

    For example, imagine you're running a Search Relevance task, where search raters label query/result pairs on a 5-point scale: Very Relevant (+2), Slightly Relevant (+1), Okay (0), Slightly Irrelevant (-1), Very Irrelevant (-2).

    Marking "Very Relevant" vs. "Slightly Relevant" isn't a big difference, but "Very Relevant" vs. "Very Irrelevant" is. However, most IRR calculations don't take this kind of ordering into account, so it gets ignored!

    Cohen's kappa is a rather simplistic and flawed metric, but a good starting point to understanding interrater reliability metrics.