This is great but we really need an audio-to-audio model like they demoed in the open source world. Does anyone know of anything like that?
Edit: someone found one: https://news.ycombinator.com/item?id=40346992
Siri came out in October 2011. Amazon Alexa made its debut in November 2014. Google Assistant's voice-activated speakers were released in May 2016.
From what I can tell, Siri is still a dumpster fire that nobody is willing to use. And I have no personal experience with Alexa, so I can't speak to it. But I do have a few Google Home speakers and an Android phone, and I have seen no major improvements in years. In fact, it has gotten worse - for example, you can no longer add items directly to AnyList[0], only Google Keep.
Or, as an incredibly simple example of something I thought we'd get a long time ago, it's still unable to interpret two-part requests, e.g. "please repeat that but louder," or "please turn off the kitchen and dining room lights."
I find voice assistants very useful - especially when driving, lying in bed, cooking, or when I'm otherwise preoccupied. Yet they have stagnated almost since their debut. I can only imagine nobody has found a viable way to monetize them.
What will it take to get a better voice assistant for consumers? Willow[1] doesn't seem to have taken off.
[0] https://help.anylist.com/articles/google-assistant-overview/
edit: I realize I hijacked your thread to dump something that's been on my mind lately. Pipecat looks really cool, and I hope it takes off! I hope to get some time to experiment this weekend.
Just made https://feycher.com thats similar, but has realtime lip syncing as well. Let me know if you are interested and we can chat
We're also building bolna an open source voice orchestration: https://github.com/bolna-ai/bolna
LiveKit Agents, which OpenAI uses in voice mode is also open source:
The whole VAD thing is very interesting, keen to learn more about how it works and especially with multiple speakers!
Very cool, great work! I can def self using this when I start building in that direction.
How would I go about using this to live translate phone calls?
I wonder how the just announced "GPT-4o" with real-time voice impacts projects like this?
The demo on real-time multi language translation conversation blew me away!
Nice to see an open source implementation, i have been seeing many startups get into this space like https://www.retellai.com/, https://fixie.ai/ etc. They always end up needing speech-to-speech models (current approach seems speech-text-text-speech with multiple agents handling 1 listening + 1 speaking), excited to see how this plays with recently announced gpt-4o