Skip to content

Realtime API

Note

Before proceeding, you should be familiar with the OpenAI Realtime API and the relevant OpenAI API reference

Warning

Real-time performance can only be achieved when using CUDA for TTS and STT inference and an LLM provider with a high TPS (tokens per second) rate and low TTFT (time to first token).

Demo

(Excuse the breathing lol. Didn't have enough time to record a better demo)

Prerequisites

Follow the prerequisites in the voice chat guide.

Architecture

TODO

Limitations

  • You'll want to be using a dedicated microphone to ensure speech produced by the TTS model is not picked up. Otherwise, the VAD and STT model will pick up the TTS audio and transcribe it, resulting in a feedback loop.
  • "response.cancel" and "conversation.item.truncate" client events are not supported. Interruption handling needs to be flushed out.
  • "conversation.item.create" with content field containing input_audio message is not supported

Next Steps

  • Address the aforementioned limitations
  • Image support
  • Speech-to-speech model support
  • Performance tuning / optimizations
  • Realtime console improvements