Interesting that they built their own benchmark and that it primarily features images from vehicles. Tesla overlap?
> ... we are introducing a new benchmark, RealWorldQA. This benchmark is designed to evaluate basic real-world spatial understanding capabilities of multimodal models. While many of the examples in the current benchmark are relatively easy for humans, they often pose a challenge for frontier models.
> The initial release of the RealWorldQA consists of over 700 images, with a question and easily verifiable answer for each image. The dataset consists of anonymized images taken from vehicles, in addition to other real-world images ... RealWorldQA is released under CC BY-ND 4.0.
Will be interesting to see the feedback once someone has a change to look into the dataset (https://data.x.ai/realworldqa.zip).
Side note — I'm very impressed with their "Explaining a meme" example.
If those comparisons to other models are remotely accurate, that is a pretty impressive feat to catch up that quickly.
[flagged]
[flagged]
I think a lot of people are sleeping on Xai for two reasons - Twitter & Tesla FSD data.
I've seen numerous talks recently by Ai leaders discussing how the next level of LLMs have to "understand the physical world" (which tesla FSD is a leader in, see Elon's response to Sora - https://www.news18.com/tech/elon-musk-claims-tesla-has-been-...), along with Twitter being the "town square" of the internet giving them an immense advantage. I wouldn't be surprised if the Twitter API was completely cut off one day.