Show HN: Open source framework OpenAI uses for Advanced Voice

  • Imagine being able to tell an app to call the IRS during the day, endure the on-hold wait times, then ask the question to the IRS rep and log the answer. Then deliver the answer when you get home.

    Or, have the app call a pharmacy every month to refill prescriptions. For some drugs, the pharmacy requires a manual phone call to refill which gets very annoying.

    So many use cases for this.

  • This is really helpful, thanks!

    OpenAI hired the ex fractional CTO of LiveKit, who created Pion, a popular WebRTC library/tool.

    I'd expect OpenAI to migrate off of LiveKit within 6 months. LiveKit is too expensive. Also, WebRTC is hard, and OpenAI now being a less open company will want to keep improvements to itself.

    Not affiliated with any competitors, but I did work at a PaaS company similar to LiveKit but used Websockets instead.

  • Super cool! Didn't realize OpenAI is just using LiveKit.

    Does the pricing breakdown to be the same as having a OpenAI Advanced Voice socket open the whole time? It's like $9/hr!

    It would be theoretically cheaper to use this without keeping the advanced voice socket open the whole time and just use the GPT4o streaming service [1] for whenever inference is needed (pay per token) and use livekits other components to do the rest (TTS, VAD etc.).

    What's the trade off here?

    [1]: https://platform.openai.com/docs/api-reference/streaming

  • That’s some crazy marketing for a „our library happened to support this relatively simple use case“ situation. Impressive!

    By the way: The cerebras voice demo also uses LiveKit for this: https://cerebras.vercel.app/

  • Olivier, Michelle, and Romain gave you guys a shoutout like 3 times in our DevDay recap podcast if you need more testimonial quotes :) https://www.latent.space/p/devday-2024

  • Is there anyone besides OpenAI working on a speech to speech model? I find it incredibly useful and it's the sole reason that I pay for their service but I do find it very limited. I'd be interested to know if any other groups are doing research on voice models.

  • I wonder when Azure OpenAI will get this.

  • This suggests that the AI "brain" receives the user input as text prompt (agent relays the speech prompt to GPT-4o) and generates audio as output (GPT-4o streams speech packets back to the agent).

    But when I asked advanced voice mode it said the exact opposite. That it receives input as audio and generates text as output.

  • Nice they have many partners on this. I see Azure as well.

    There is a common consensus that the new Realtime API is not actually using the same Advanced Voice model / engine - or however it works - since at least the TTS part doesn’t seem to be as capable as the one shipped with the official OpenAI app.

    Any idea on this?

    Source: https://github.com/openai/openai-realtime-api-beta/issues/2

  • so the WebRTC helps with the unreliable network between the mobile clients and the server side. if the application is backend only, would it make sense to use WebRTC or should I go directly to realtime api?

  • That was cool, but got up to $1 usage real quick