Skip to content

Voice Chat

Note

Before proceeding, you should be familiar with OpenAI Audio Generation Guide. The guide explains how the API works and provides examples on how to use. Unless stated otherwise in limitations if a feature is supported by OpenAI, it should be supported by this project as well.

Prerequisites

Follow the prerequisites in the Text-to-Speech guide. And set the following environmental variables:

  • CHAT_COMPLETION_BASE_URL to the base URL of an OpenAI API compatible endpoint | Config
  • CHAT_COMPLETION_MODEL to the name of the model you'd like to use. | Config
  • CHAT_COMPLETION_API_KEY if the API you are using requires authentication | Config

Ollama example:

export CHAT_COMPLETION_BASE_URL=http://localhost:11434
export CHAT_COMPLETION_MODEL=llama3.2:1

OpenAI example:

export CHAT_COMPLETION_BASE_URL=https://api.openai.com/v1
export CHAT_COMPLETION_MODEL=gpt-4o-mini
export CHAT_COMPLETION_API_KEY=sk-xxx

How does it work?

speaches acts as a proxy between the client and the OpenAI compatible API. It receives the user's input audio message, transcribes it, and replaces the input audio with a transcription that a regular LLM can understand. The request is then proxied to the provided endpoint, the response is transcribed back to audio, and returned to the client. Below is the flow of the process for non-streaming text + audio in → text + audio out:

  • Receive the messages containing audio inputs
[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "What is in this recording?"
      },
      {
        "type": "input_audio",
        "input_audio": {
          "data": "<bytes omitted>",
          "format": "wav"
        }
      }
    ]
  }
]
  • Transcribe the audio (POST /v1/audio/transcriptions) and replace the input audio with the transcription
[
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "What is in this recording?"
      },
      {
        "type": "text",
        "text": "Hello World!"
      }
    ]
  }
]
  • Proxy the request to the OpenAI compatible API specified in the environmental variables and receive the response
[
  {
    "role": "assistant",
    "content": [
      {
        "type": "text",
        "text": "The recording says Hello World!."
      }
    ]
  }
]
  • Transcribe the response and replace the text with the audio (POST /v1/audio/speech). The audio is then returned to the client
[
  {
    "index": 0,
    "message": {
      "role": "assistant",
      "content": null,
      "refusal": null,
      "audio": {
        "id": "audio_abc123",
        "expires_at": 1729018505,
        "data": "<bytes omitted>",
        "transcript": "The recording says Hello World!."
      }
    },
    "finish_reason": "stop"
  }
]

Customization

The chat completion endpoint exposes additional parameters for customization. The following parameters are available:

  • transcription_model: The model to use for transcribing the audio.
  • speech_model: The model to use for generating the audio.

When using OpenAI's Python SDK the parameters can be set using extra_body parameter. For example:

openai_client.chat.completions.create(
        model="gpt-4o-mini",
        modalities=["text", "audio"],
        audio={"voice": "alloy", "format": "wav"},
        stream=False,
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What is in this recording?"},
                    {"type": "input_audio", "input_audio": {"data": "<bytes ommitted>", "format": "wav"}},
                ],
            },
        ],
        extra_body={"transcription_model": "Systran/faster-whisper-tiny.en", "speech_model": "hexgrad/Kokoro-82M"}
    )

Limitations

  • User's input audio message are not cached. That means the user's input audio message will be transcribed each time it sent. This can be a performance issue when doing long multi-turn conversations.
  • Multiple choices (n > 1) are not supported

This features utilizes Text-to-Speech and Speech-to-Text features. Therefore, the limitations of those features apply here as well.