What do you mean? We want images and text to live in the same latent space, and be represented by similar vectors if the two correlate. How else would you want to do it?
What do you mean? We want images and text to live in the same latent space, and be represented by similar vectors if the two correlate. How else would you want to do it?