Realtime API
Note
Before proceeding, you should be familiar with the OpenAI Realtime API and the relevant OpenAI API reference
Warning
Real-time performance can only be achieved when using CUDA for TTS and STT inference and an LLM provider with a high TPS (tokens per second) rate and low TTFT (time to first token).
Demo
(Excuse the breathing lol. Didn't have enough time to record a better demo)
Prerequisites
Follow the prerequisites in the voice chat guide.
Architecture
TODO
Limitations
- You'll want to be using a dedicated microphone to ensure speech produced by the TTS model is not picked up. Otherwise, the VAD and STT model will pick up the TTS audio and transcribe it, resulting in a feedback loop.
- "response.cancel" and "conversation.item.truncate" client events are not supported. Interruption handling needs to be flushed out.
- "conversation.item.create" with
content
field containinginput_audio
message is not supported
Next Steps
- Address the aforementioned limitations
- Image support
- Speech-to-speech model support
- Performance tuning / optimizations
- Realtime console improvements