Author here. (If you can call me that. GPT-4 and Gemini did the bulk of the work)
This is a (slightly tongue in cheek) benchmark to test some LLMs. All open source and all the data is in the repo.
It makes use of the excellent `llm` Python package from Simon Willison.
I've only benchmarked a couple of local models but want to see what the smallest LLM is that will score above the estimated "human CEO" performance. How long before a sub-1B parameter model performs better than a tech giant CEO?
Thanks for submitting this!
Author here. (If you can call me that. GPT-4 and Gemini did the bulk of the work)
This is a (slightly tongue in cheek) benchmark to test some LLMs. All open source and all the data is in the repo.
It makes use of the excellent `llm` Python package from Simon Willison.
I've only benchmarked a couple of local models but want to see what the smallest LLM is that will score above the estimated "human CEO" performance. How long before a sub-1B parameter model performs better than a tech giant CEO?