Text-to-4D Dynamic Scene Generation

  • These videos look too much like the things and their movement that I see in dreams. They are blurryish but makes sense but actually don't. e.g. the running rabbit, its legs are moving but its not. This is almost exactly how I remember dreams, when I see people moving, I can rarely notice their limbs moving accordingly. When I look at my own hands they might have more than 5 five fingers and very vague and blurry hand lines. When i try to run or walk, or fly its just as weird as these videos.

    This reminds of how the first generation of these kind of image generators were said to be 'dreaming'. This also makes me think that do our brains really work like these algorithms (or these algos are mimicking brains very correctly).

  • > trained only on Text-Image pairs and unlabeled videos

    This is fascinating. It's able to pick up sufficiently on the fundamentals of 3D motion from 2D videos, while only needing static images with descriptions to infer semantics.

  • Link to paper https://arxiv.org/abs/2301.11280, dynamic visualisations only work in Chrome (?)

  • Getting something that generates multiple angles of the same subject in different typical poses would go a long way. I can get midjourney to kind of do this by asking for "multiple angles", but it's hit or mis.

  • I've expected NERF + Diffusion models for a while, but it looks like there's still a lot of work needed before it gets practical.

  • I really wish these datasets were more openly accessiable. I always want to try replicating these models but it seems that the data is the blocker. Renting the compute needed to create an inferiror model does not seem to be an issue, it's always the data.

  • Here we go again. The samples look uncannily similar to the early text-to-image stuff we had.

  • Can someone explain what’s 4D about this? Is it 4D because the 3D models are animated (moving)?

  • Another paper, with no code released?

    What's the point then?