V-JEPA: Video Joint Embedding Predictive Architecture (V-JEPA) Model

  • This would have been bigger news except for Gemini 1.5, Sora, and the Magic investment all happening at the same time. Gemini can do needle in a haystack reliably in the 3hrs of video they tested against.

  • Look at how Alpha Go started with human data, and then they found a way to train it without that. I've been wondering if it might be possible to do a similar thing with LLMs by grounding them on real world video by having them predict what happens in the video. I suppose you'd still need some minimal language ability to bootstrap it from, but imagine it learning the laws of physics and mathematics from the ground up.