This is really helpful, thanks!
OpenAI hired the ex fractional CTO of LiveKit, who created Pion, a popular WebRTC library/tool.
I'd expect OpenAI to migrate off of LiveKit within 6 months. LiveKit is too expensive. Also, WebRTC is hard, and OpenAI now being a less open company will want to keep improvements to itself.
Not affiliated with any competitors, but I did work at a PaaS company similar to LiveKit but used Websockets instead.
Super cool! Didn't realize OpenAI is just using LiveKit.
Does the pricing breakdown to be the same as having a OpenAI Advanced Voice socket open the whole time? It's like $9/hr!
It would be theoretically cheaper to use this without keeping the advanced voice socket open the whole time and just use the GPT4o streaming service [1] for whenever inference is needed (pay per token) and use livekits other components to do the rest (TTS, VAD etc.).
What's the trade off here?
[1]: https://platform.openai.com/docs/api-reference/streaming
That’s some crazy marketing for a „our library happened to support this relatively simple use case“ situation. Impressive!
By the way: The cerebras voice demo also uses LiveKit for this: https://cerebras.vercel.app/
Olivier, Michelle, and Romain gave you guys a shoutout like 3 times in our DevDay recap podcast if you need more testimonial quotes :) https://www.latent.space/p/devday-2024
Is there anyone besides OpenAI working on a speech to speech model? I find it incredibly useful and it's the sole reason that I pay for their service but I do find it very limited. I'd be interested to know if any other groups are doing research on voice models.
I wonder when Azure OpenAI will get this.
This suggests that the AI "brain" receives the user input as text prompt (agent relays the speech prompt to GPT-4o) and generates audio as output (GPT-4o streams speech packets back to the agent).
But when I asked advanced voice mode it said the exact opposite. That it receives input as audio and generates text as output.
Nice they have many partners on this. I see Azure as well.
There is a common consensus that the new Realtime API is not actually using the same Advanced Voice model / engine - or however it works - since at least the TTS part doesn’t seem to be as capable as the one shipped with the official OpenAI app.
Any idea on this?
Source: https://github.com/openai/openai-realtime-api-beta/issues/2
so the WebRTC helps with the unreliable network between the mobile clients and the server side. if the application is backend only, would it make sense to use WebRTC or should I go directly to realtime api?
That was cool, but got up to $1 usage real quick
Imagine being able to tell an app to call the IRS during the day, endure the on-hold wait times, then ask the question to the IRS rep and log the answer. Then deliver the answer when you get home.
Or, have the app call a pharmacy every month to refill prescriptions. For some drugs, the pharmacy requires a manual phone call to refill which gets very annoying.
So many use cases for this.