He's suggesting using deterministic scores instead of LLM evaluations because of the variability in the latter. Isn't this just trading off accuracy for precision? Or bias-variance tradeoff, Heisenberg, etc. The LLM evaluations can be "better" in the sense that they're in principle measuring what you want to measure, but are more variable because of, for lack of a better word, subjectivity. They have a higher spread but the average is more reflective of ground truth. The deterministic measures measure some proxy - they are repeatable but generally rough approximations. Depends what you want.
He's suggesting using deterministic scores instead of LLM evaluations because of the variability in the latter. Isn't this just trading off accuracy for precision? Or bias-variance tradeoff, Heisenberg, etc. The LLM evaluations can be "better" in the sense that they're in principle measuring what you want to measure, but are more variable because of, for lack of a better word, subjectivity. They have a higher spread but the average is more reflective of ground truth. The deterministic measures measure some proxy - they are repeatable but generally rough approximations. Depends what you want.