Gemini Ultra is out. Does it beat GPT4? (~10k words of tests/observations)

  • (They patched Gemini an hour after I finished writing this. My complaints of excessive refusals and model deception may no longer apply.)

    Witness: - A chess match between Ultra and GPT4 (the first one ever, as far as I'm aware) - A Gemini vs GPT4 rap battle - Tests of general knowledge, recall, abstract reasoning, and code generation - Head-to-head contests of poetry and prose, plus style imitations of famous authors/bloggers

    I also investigate VERY IMPORTANT things such as:

    - which model can create a more realistic ASCII cat? - which model is better at stacking eggs? - which model plays Wordle better? - which model SIMULATES Wordle better (with me playing)?

    Obviously, a lot of my tests are a bit silly. We already know Ultra's benchmarks, I'm trying to probe the gaps BETWEEN benchmarks, and figure out what the models are like "on the ground".

    Conventional wisdom holds that Ultra is another GPT4: this was not my experience. Switching from GPT4 to Ultra feels like switching character classes in an RPG; they are quite different, with distinct strengths and weaknesses.