Hacker News

Ask HN: Any truly multi-modal transformer architectures?

by prats226on 3/4/2025, 12:25:20 AM with 1 comment

by kadushkaon 3/4/2025, 2:47:32 AM
What do you mean? We want images and text to live in the same latent space, and be represented by similar vectors if the two correlate. How else would you want to do it?