Audio (Speech & Transcription)
TIE provides OpenAI-compatible audio endpoints for speech-to-text (transcription) and text-to-speech (TTS). These are drop-in replacements for OpenAI's /v1/audio/transcriptions and /v1/audio/speech — any client that works with OpenAI's audio API works with TIE by changing the base URL.
All requests require a Bearer token from TIE Auth.
Speech-to-Text (Transcription)
Section titled “Speech-to-Text (Transcription)”Convert audio to text. TIE proxies to OpenAI Whisper and returns the transcript.
Endpoint
Section titled “Endpoint”POST /v1/audio/transcriptionsParameters
Section titled “Parameters”Send as multipart/form-data:
| Parameter | Type | Default | Description |
|---|---|---|---|
file | file | required | Audio file (mp3, mp4, mpeg, mpga, m4a, wav, webm) |
model | string | whisper-1 | Transcription model |
language | string | null | ISO 639-1 language code (e.g. en). Improves accuracy if known. |
prompt | string | null | Guide the model's style or continue a previous segment |
response_format | string | json | Output format: json, text, srt, verbose_json, vtt |
temperature | float | 0.0 | Sampling temperature (0–1). Lower is more deterministic. |
Example
Section titled “Example”curl -X POST https://your-tie-host/v1/audio/transcriptions \ -H "Authorization: Bearer $TOKEN" \ -F file=@recording.webm \ -F model=whisper-1Response
Section titled “Response”Default (json format):
{ "text": "Hello, I'd like to log my breakfast. I had oatmeal with blueberries."}With response_format=verbose_json, the response includes timestamps and segment-level detail.
AI SDK Integration
Section titled “AI SDK Integration”The Vercel AI SDK supports transcription via experimental_transcribe. Point the OpenAI provider at your TIE instance:
import { createOpenAI } from "@ai-sdk/openai";import { experimental_transcribe as transcribe } from "ai";
const tie = createOpenAI({ baseURL: "https://your-tie-host/v1", apiKey: "your-bearer-token",});
const result = await transcribe({ model: tie.transcription("whisper-1"), audio: audioBuffer, // Uint8Array, Buffer, base64 string, or URL});
console.log(result.text);// "Hello, I'd like to log my breakfast."Text-to-Speech
Section titled “Text-to-Speech”Convert text to spoken audio. TIE proxies to OpenAI's TTS models and streams audio bytes back.
Endpoint
Section titled “Endpoint”POST /v1/audio/speechParameters
Section titled “Parameters”Send as JSON:
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | gpt-4o-mini-tts | TTS model (gpt-4o-mini-tts, tts-1, tts-1-hd) |
input | string | required | Text to convert to speech (max 4096 characters) |
voice | string | alloy | Voice: alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse |
response_format | string | mp3 | Audio format: mp3, opus, aac, flac, wav, pcm |
speed | float | 1.0 | Speed multiplier (0.25–4.0) |
Example
Section titled “Example”curl -X POST https://your-tie-host/v1/audio/speech \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $TOKEN" \ -d '{ "model": "gpt-4o-mini-tts", "input": "Great job logging your meals today! You hit your protein target.", "voice": "nova" }' \ --output response.mp3Response
Section titled “Response”Raw audio bytes in the requested format. The Content-Type header reflects the format (e.g. audio/mpeg for mp3).
AI SDK Integration
Section titled “AI SDK Integration”import { createOpenAI } from "@ai-sdk/openai";import { experimental_generateSpeech as generateSpeech } from "ai";
const tie = createOpenAI({ baseURL: "https://your-tie-host/v1", apiKey: "your-bearer-token",});
const result = await generateSpeech({ model: tie.speech("gpt-4o-mini-tts"), text: "Great job logging your meals today!", voice: "nova",});
// result.audio is a Uint8Array of audio bytesCommon Pattern: Voice Chat
Section titled “Common Pattern: Voice Chat”A typical voice-enabled chat flow combines both endpoints:
- User speaks → record audio on the client
- Transcribe →
POST /v1/audio/transcriptions→ get text - Chat →
POST /v1/chat/completionswith the transcribed text → get AI response - Speak →
POST /v1/audio/speechwith the AI response text → play audio
sequenceDiagram
participant User
participant Client
participant TIE
User->>Client: Speaks
Client->>TIE: POST /v1/audio/transcriptions (audio file)
TIE-->>Client: { "text": "..." }
Client->>TIE: POST /v1/chat/completions (transcribed text)
TIE-->>Client: AI response text
Client->>TIE: POST /v1/audio/speech (response text)
TIE-->>Client: Audio bytes
Client->>User: Plays audio
Steps 3 and 4 can overlap — start TTS as soon as the chat response text is available, rather than waiting for the full response.
Models
Section titled “Models”Transcription
Section titled “Transcription”| Model | Description |
|---|---|
whisper-1 | OpenAI Whisper — general-purpose, supports 50+ languages |
Text-to-Speech
Section titled “Text-to-Speech”| Model | Description |
|---|---|
gpt-4o-mini-tts | Latest, most natural-sounding. Default. |
tts-1 | Optimized for low latency |
tts-1-hd | Optimized for quality |
Voices
Section titled “Voices”| Voice | Tone |
|---|---|
alloy | Neutral, balanced |
ash | Warm, conversational |
ballad | Soft, expressive |
coral | Clear, friendly |
echo | Steady, authoritative |
fable | Expressive, storytelling |
nova | Warm, engaging |
onyx | Deep, rich |
sage | Calm, measured |
shimmer | Bright, upbeat |
verse | Versatile, adaptive |