Very cool. If I ask to deduce the gender of my voice, can it do that? Training a projection layer makes sense, but ultimately you'd want to output audio conditioned on the input rather than text. Is there a way to train a reverse projection with some kind of skip connections to take audio input into account? Or an end to end audio model?
Very cool! How is this differentiated from ChatGPT voice?
Very cool!!! I had this idea a while. Is the conversational part of the dataset open?
I'm building various prototypes for VR training simulations using Inworld. But they also use the cascaded approach. Also, I am building customer service agent product which we would love to add voice to but whisper and eleven labs (and others) are just too slow. Is tincan available via API?