GPT-4o gets 45% on new history benchmark