Look at how Alpha Go started with human data, and then they found a way to train it without that. I've been wondering if it might be possible to do a similar thing with LLMs by grounding them on real world video by having them predict what happens in the video. I suppose you'd still need some minimal language ability to bootstrap it from, but imagine it learning the laws of physics and mathematics from the ground up.
This would have been bigger news except for Gemini 1.5, Sora, and the Magic investment all happening at the same time. Gemini can do needle in a haystack reliably in the 3hrs of video they tested against.