Audio (Speech & Transcription)

TIE provides OpenAI-compatible audio endpoints for speech-to-text (transcription) and text-to-speech (TTS). These are drop-in replacements for OpenAI's /v1/audio/transcriptions and /v1/audio/speech — any client that works with OpenAI's audio API works with TIE by changing the base URL.

All requests require a Bearer token from TIE Auth.

Speech-to-Text (Transcription)

Convert audio to text. TIE proxies to OpenAI Whisper and returns the transcript.

Endpoint

POST /v1/audio/transcriptions

Parameters

Send as multipart/form-data:

Parameter	Type	Default	Description
`file`	file	required	Audio file (mp3, mp4, mpeg, mpga, m4a, wav, webm)
`model`	string	`whisper-1`	Transcription model
`language`	string	null	ISO 639-1 language code (e.g. `en`). Improves accuracy if known.
`prompt`	string	null	Guide the model's style or continue a previous segment
`response_format`	string	`json`	Output format: `json`, `text`, `srt`, `verbose_json`, `vtt`
`temperature`	float	0.0	Sampling temperature (0–1). Lower is more deterministic.

Example

curl -X POST https://your-tie-host/v1/audio/transcriptions \
  -H "Authorization: Bearer $TOKEN" \
  -F file=@recording.webm \
  -F model=whisper-1

Response

Default (json format):

{
  "text": "Hello, I'd like to log my breakfast. I had oatmeal with blueberries."
}

With response_format=verbose_json, the response includes timestamps and segment-level detail.

AI SDK Integration

The Vercel AI SDK supports transcription via experimental_transcribe. Point the OpenAI provider at your TIE instance:

import { createOpenAI } from "@ai-sdk/openai";
import { experimental_transcribe as transcribe } from "ai";

const tie = createOpenAI({
  baseURL: "https://your-tie-host/v1",
  apiKey: "your-bearer-token",
});

const result = await transcribe({
  model: tie.transcription("whisper-1"),
  audio: audioBuffer, // Uint8Array, Buffer, base64 string, or URL
});

console.log(result.text);
// "Hello, I'd like to log my breakfast."

Text-to-Speech

Convert text to spoken audio. TIE proxies to OpenAI's TTS models and streams audio bytes back.

Endpoint

POST /v1/audio/speech

Parameters

Send as JSON:

Parameter	Type	Default	Description
`model`	string	`gpt-4o-mini-tts`	TTS model (`gpt-4o-mini-tts`, `tts-1`, `tts-1-hd`)
`input`	string	required	Text to convert to speech (max 4096 characters)
`voice`	string	`alloy`	Voice: `alloy`, `ash`, `ballad`, `coral`, `echo`, `fable`, `nova`, `onyx`, `sage`, `shimmer`, `verse`
`response_format`	string	`mp3`	Audio format: `mp3`, `opus`, `aac`, `flac`, `wav`, `pcm`
`speed`	float	1.0	Speed multiplier (0.25–4.0)

Example

curl -X POST https://your-tie-host/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "Great job logging your meals today! You hit your protein target.",
    "voice": "nova"
  }' \
  --output response.mp3

Response

Raw audio bytes in the requested format. The Content-Type header reflects the format (e.g. audio/mpeg for mp3).

AI SDK Integration

import { createOpenAI } from "@ai-sdk/openai";
import { experimental_generateSpeech as generateSpeech } from "ai";

const tie = createOpenAI({
  baseURL: "https://your-tie-host/v1",
  apiKey: "your-bearer-token",
});

const result = await generateSpeech({
  model: tie.speech("gpt-4o-mini-tts"),
  text: "Great job logging your meals today!",
  voice: "nova",
});

// result.audio is a Uint8Array of audio bytes

Common Pattern: Voice Chat

A typical voice-enabled chat flow combines both endpoints:

User speaks → record audio on the client
Transcribe → POST /v1/audio/transcriptions → get text
Chat → POST /v1/chat/completions with the transcribed text → get AI response
Speak → POST /v1/audio/speech with the AI response text → play audio

sequenceDiagram
    participant User
    participant Client
    participant TIE

    User->>Client: Speaks
    Client->>TIE: POST /v1/audio/transcriptions (audio file)
    TIE-->>Client: { "text": "..." }
    Client->>TIE: POST /v1/chat/completions (transcribed text)
    TIE-->>Client: AI response text
    Client->>TIE: POST /v1/audio/speech (response text)
    TIE-->>Client: Audio bytes
    Client->>User: Plays audio

Steps 3 and 4 can overlap — start TTS as soon as the chat response text is available, rather than waiting for the full response.

Models

Transcription

Model	Description
`whisper-1`	OpenAI Whisper — general-purpose, supports 50+ languages

Text-to-Speech

Model	Description
`gpt-4o-mini-tts`	Latest, most natural-sounding. Default.
`tts-1`	Optimized for low latency
`tts-1-hd`	Optimized for quality

Voices

Voice	Tone
`alloy`	Neutral, balanced
`ash`	Warm, conversational
`ballad`	Soft, expressive
`coral`	Clear, friendly
`echo`	Steady, authoritative
`fable`	Expressive, storytelling
`nova`	Warm, engaging
`onyx`	Deep, rich
`sage`	Calm, measured
`shimmer`	Bright, upbeat
`verse`	Versatile, adaptive