For those unfamiliar with the benchmarks, it would be good to know if a higher or lower score was better. E.g. are they measuring accuracy or error rate, etc.
You can infer it by reading the text, and checking the table carefully, but it would be nice if the answer is easier to find.
> GPT-4 Early which is supposedly to be more powerful than GPT-4 Launch (OpenAI paid a lot of alignment tax to make GPT-4 safer)
What does “safer” mean?
Does it mean censored?
Less scientific, but arguably more practical benchmarks here:
Q: What does "alignment tax" mean in this sentence?
> OpenAI paid a lot of alignment tax to make GPT-4 safer.
I have been amused by how bad GTP-4 and Bard are at playing tic-tac-toe. Also utterly clueless at othello.
> Also be careful that GPT-4/ 3.5's performance on GSM8K is not true few-shot -- in GPT-4 report they said that they mixed a portion of GSM8K training set to train the model
It'd be really valuable to have "fuzzed" versions of these benchmarks, where you replace quantities in the questions with randomly-sampled values, so that this wasn't a concern. Of course, then the score would itself be a random variable, but you could just return an interval.