Ask HN: Any truly multi-modal transformer architectures?

  • What do you mean? We want images and text to live in the same latent space, and be represented by similar vectors if the two correlate. How else would you want to do it?