Skip to content

Audio (Speech & Transcription)

TIE provides OpenAI-compatible audio endpoints for speech-to-text (transcription) and text-to-speech (TTS). These are drop-in replacements for OpenAI's /v1/audio/transcriptions and /v1/audio/speech — any client that works with OpenAI's audio API works with TIE by changing the base URL.

All requests require a Bearer token from TIE Auth.

Convert audio to text. TIE proxies to OpenAI Whisper and returns the transcript.

POST /v1/audio/transcriptions

Send as multipart/form-data:

ParameterTypeDefaultDescription
filefilerequiredAudio file (mp3, mp4, mpeg, mpga, m4a, wav, webm)
modelstringwhisper-1Transcription model
languagestringnullISO 639-1 language code (e.g. en). Improves accuracy if known.
promptstringnullGuide the model's style or continue a previous segment
response_formatstringjsonOutput format: json, text, srt, verbose_json, vtt
temperaturefloat0.0Sampling temperature (0–1). Lower is more deterministic.
Terminal window
curl -X POST https://your-tie-host/v1/audio/transcriptions \
-H "Authorization: Bearer $TOKEN" \
-F file=@recording.webm \
-F model=whisper-1

Default (json format):

{
"text": "Hello, I'd like to log my breakfast. I had oatmeal with blueberries."
}

With response_format=verbose_json, the response includes timestamps and segment-level detail.

The Vercel AI SDK supports transcription via experimental_transcribe. Point the OpenAI provider at your TIE instance:

import { createOpenAI } from "@ai-sdk/openai";
import { experimental_transcribe as transcribe } from "ai";
const tie = createOpenAI({
baseURL: "https://your-tie-host/v1",
apiKey: "your-bearer-token",
});
const result = await transcribe({
model: tie.transcription("whisper-1"),
audio: audioBuffer, // Uint8Array, Buffer, base64 string, or URL
});
console.log(result.text);
// "Hello, I'd like to log my breakfast."

Convert text to spoken audio. TIE proxies to OpenAI's TTS models and streams audio bytes back.

POST /v1/audio/speech

Send as JSON:

ParameterTypeDefaultDescription
modelstringgpt-4o-mini-ttsTTS model (gpt-4o-mini-tts, tts-1, tts-1-hd)
inputstringrequiredText to convert to speech (max 4096 characters)
voicestringalloyVoice: alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse
response_formatstringmp3Audio format: mp3, opus, aac, flac, wav, pcm
speedfloat1.0Speed multiplier (0.25–4.0)
Terminal window
curl -X POST https://your-tie-host/v1/audio/speech \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
"model": "gpt-4o-mini-tts",
"input": "Great job logging your meals today! You hit your protein target.",
"voice": "nova"
}' \
--output response.mp3

Raw audio bytes in the requested format. The Content-Type header reflects the format (e.g. audio/mpeg for mp3).

import { createOpenAI } from "@ai-sdk/openai";
import { experimental_generateSpeech as generateSpeech } from "ai";
const tie = createOpenAI({
baseURL: "https://your-tie-host/v1",
apiKey: "your-bearer-token",
});
const result = await generateSpeech({
model: tie.speech("gpt-4o-mini-tts"),
text: "Great job logging your meals today!",
voice: "nova",
});
// result.audio is a Uint8Array of audio bytes

A typical voice-enabled chat flow combines both endpoints:

  1. User speaks → record audio on the client
  2. TranscribePOST /v1/audio/transcriptions → get text
  3. ChatPOST /v1/chat/completions with the transcribed text → get AI response
  4. SpeakPOST /v1/audio/speech with the AI response text → play audio
sequenceDiagram
    participant User
    participant Client
    participant TIE

    User->>Client: Speaks
    Client->>TIE: POST /v1/audio/transcriptions (audio file)
    TIE-->>Client: { "text": "..." }
    Client->>TIE: POST /v1/chat/completions (transcribed text)
    TIE-->>Client: AI response text
    Client->>TIE: POST /v1/audio/speech (response text)
    TIE-->>Client: Audio bytes
    Client->>User: Plays audio

Steps 3 and 4 can overlap — start TTS as soon as the chat response text is available, rather than waiting for the full response.

ModelDescription
whisper-1OpenAI Whisper — general-purpose, supports 50+ languages
ModelDescription
gpt-4o-mini-ttsLatest, most natural-sounding. Default.
tts-1Optimized for low latency
tts-1-hdOptimized for quality
VoiceTone
alloyNeutral, balanced
ashWarm, conversational
balladSoft, expressive
coralClear, friendly
echoSteady, authoritative
fableExpressive, storytelling
novaWarm, engaging
onyxDeep, rich
sageCalm, measured
shimmerBright, upbeat
verseVersatile, adaptive