Hacker News

V-JEPA: Video Joint Embedding Predictive Architecture (V-JEPA) Model

by agnosticmantison 2/16/2024, 1:53:01 AM with 2 comments

by jimmySixDOFon 2/16/2024, 9:08:32 AM
This would have been bigger news except for Gemini 1.5, Sora, and the Magic investment all happening at the same time. Gemini can do needle in a haystack reliably in the 3hrs of video they tested against.
by bitshiftfacedon 2/16/2024, 3:15:57 PM
Look at how Alpha Go started with human data, and then they found a way to train it without that. I've been wondering if it might be possible to do a similar thing with LLMs by grounding them on real world video by having them predict what happens in the video. I suppose you'd still need some minimal language ability to bootstrap it from, but imagine it learning the laws of physics and mathematics from the ground up.