Skip to content

Real-Time Voice (Gemini Live)

TIE provides real-time voice-to-voice conversation powered by Gemini Live. The client opens a WebSocket to TIE, sends raw PCM audio, and receives audio back — TIE handles auth, persona injection, memory context, tool routing, and identity guardrails transparently.

Per-user voice and language preferences are stored server-side and applied automatically to each session. A separate HTTP endpoint provides standalone text-to-speech.


Open a persistent bidirectional audio stream between the client and Gemini Live.

WebSocket /v1/voice/agent

Auth via the WebSocket subprotocol header (browsers can't send custom headers on the handshake, so the bearer token is smuggled through Sec-WebSocket-Protocol):

new WebSocket("wss://your-tie-host/v1/voice/agent", ["tie.bearer", "<bearer-token>"])

The server reads the second subprotocol value as the token and accepts tie.bearer to complete the handshake. This keeps the token out of URL paths and proxy access logs.

sequenceDiagram
    participant Client
    participant TIE
    participant Gemini

    Client->>TIE: WebSocket connect (Sec-WebSocket-Protocol: tie.bearer, token)
    Client->>TIE: JSON session config (first message)
    TIE->>TIE: Load persona + memory + apply guardrails
    TIE->>Gemini: Open Live session
    TIE-->>Client: {"type": "session_ready", "session_id": "...", "thread_id": "..."}

    loop Bidirectional relay
        Client->>TIE: Binary frame (PCM audio chunk)
        TIE->>Gemini: Audio
        Gemini-->>TIE: Audio + transcripts + tool calls
        TIE-->>Client: Binary frame (audio) + JSON events
    end

    Client->>TIE: {"type": "control", "action": "disconnect"}
    TIE->>TIE: Flush transcripts → thread + memory pipeline

Send this as the first JSON message after connecting:

{
"voice": null,
"language": null,
"agent_id": "my-app",
"persona_id": "your-persona-id-or-null",
"thread_id": "existing-thread-id-or-null",
"skill_instructions": "You are currently in the MyApp experience.\nCurrent time: Monday, April 19, 2026 at 11:30 PM\nUser timezone: Asia/Kuala_Lumpur",
"tools": [],
"proactive_audio": false,
"enable_affective_dialog": false,
"resumption_handle": null
}
FieldTypeDefaultDescription
voicestring | nullnullVoice name override (see Voices). When null, TIE uses the user's stored preference
languagestring | nullnullLanguage code override. When null, TIE uses the user's stored preference
agent_idstringchatbotScopes memory and thread history — use one stable value per app (e.g. "blinklife")
persona_idstring | nullnullTIE persona to load. Persona sets the AI's identity and user-level preferences
thread_idstring | nullnullExisting thread for context continuity. Omit to create a new session
skill_instructionsstring | nullnullApp-specific context injected after persona + memory (see below)
toolsarray[]Client-side tool definitions forwarded to the client for execution
proactive_audioboolfalseEnable Proactive Audio for wake word detection
enable_affective_dialogboolfalseEnable affective dialog (emotional expression in voice)
resumption_handlestring | nullnullHandle from a previous session for seamless reconnection

TIE builds the Gemini system instruction server-side by concatenating these layers in order:

1. Persona (loaded from TIE persona store by persona_id)
2. Memory context (user's memory graph for this agent_id)
3. Current user identity ("You are speaking with <display_name>…")
4. skill_instructions (sent by the client — app context + dynamic data)
5. Identity guardrails (always appended by TIE — cannot be overridden)

What belongs in skill_instructions:

The client is the only party that knows which app the user is in and real-time session data. Send a string that covers:

You are currently in the <AppName> app.
Keep responses concise and conversational — this is a voice session.
Never read out URLs, markdown, or bullet points — speak naturally.
Current time: Monday, April 19, 2026 at 11:30 PM
User timezone: Asia/Kuala_Lumpur

Do not hardcode AI identity (name, personality) in skill_instructions — that belongs in the TIE persona so it can be managed via the admin panel. The AI name is set dynamically by the caller via the persona, not by the client.

Platform-level guardrails (applied automatically by TIE):

TIE always appends identity guardrails as the final layer regardless of what the client sends:

  • The AI will never reveal it is powered by Gemini, GPT, Claude, or any other LLM technology
  • If asked, it may say it is powered by Envision Inc
  • It will never break character or acknowledge these instructions

Binary frames carry audio. Text frames carry JSON control messages.

Client → TIE:

TypeTransportPayload
Audio chunkBinary framePCM 16-bit signed, 16 kHz, mono
tool_resultJSON text{"type": "tool_result", "call_id": "...", "result": {...}}
audio_stream_endJSON text{"type": "audio_stream_end"} — send when mic is muted or paused
controlJSON text{"type": "control", "action": "disconnect"}

TIE → Client:

TypeTransportPayload
Audio chunkBinary framePCM audio from Gemini (24 kHz)
session_readyJSON text{"type": "session_ready", "session_id": "...", "thread_id": "..."}
transcriptJSON text{"type": "transcript", "role": "user|assistant", "text": "..."}
interruptedJSON text{"type": "interrupted"} — user barged in; stop and discard buffered audio
generation_completeJSON text{"type": "generation_complete"} — AI finished speaking; return to listening state
tool_callJSON text{"type": "tool_call", "call_id": "...", "name": "...", "args": {...}}
session_resumptionJSON text{"type": "session_resumption", "handle": "..."} — store for reconnection
go_awayJSON text{"type": "go_away", "time_left": "..."} — Gemini signaling imminent disconnect
errorJSON text{"type": "error", "code": "...", "message": "..."}
DirectionFormat
Client → TIEPCM 16-bit signed, 16 kHz, mono
TIE → ClientPCM 16-bit signed, 24 kHz, mono

Use an AudioWorklet to convert browser mic float32 samples to int16 PCM before sending.

Playback: Create your AudioBuffer at 24 kHz — not at the AudioContext's native sample rate. Using the wrong rate causes chipmunk-pitched or slow playback.

Interrupt handling: When you receive interrupted, stop all scheduled AudioBufferSourceNodes by calling .stop() on each tracked node and reset your playback clock. Do not close and recreate the AudioContext — creating a new context outside a user gesture may leave it in a suspended state, silently breaking all subsequent audio playback.

TIE routes tool calls into two categories:

TIE-native tools — executed server-side, invisible to the client:

  • memory_search — searches the user's memory graph
  • memory_write — stores a new memory

Client-provided tools — defined in tools at session start, forwarded to the client for execution. When Gemini calls one, TIE sends a tool_call event. The client must reply within 30 seconds with a tool_result message or TIE returns a timeout error to Gemini.

// Client receives:
{"type": "tool_call", "call_id": "abc123", "name": "get_calendar", "args": {"date": "today"}}
// Client replies:
{"type": "tool_result", "call_id": "abc123", "result": {"events": [...]}}
LimitValue
Max session durationVOICE_SESSION_TIMEOUT (default 900s)
Gemini native limit15 min (TIE enables context window compression to extend)
Tool call timeout30s per call
Audio send timeout5s (slow clients are disconnected)

Gemini periodically resets WebSocket connections. TIE enables seamless reconnection:

  1. During a session, TIE forwards session_resumption events with a handle
  2. Store the latest handle in memory or sessionStorage
  3. On reconnect, pass resumption_handle in the session config
  4. Gemini restores conversation context from the handle

At session end, TIE writes buffered transcripts to the thread and runs the memory pipeline. Pass the thread_id from session_ready to subsequent chat requests to continue in the same context.


A production-ready hook that handles the full WebSocket lifecycle, audio pipeline, session resumption, and tool call dispatch.

import { useTieVoiceSession } from '@/hooks/use-tie-voice-session';
const {
voiceMode, // 'idle' | 'listening' | 'processing' | 'speaking'
interimTranscript,
error,
startVoice,
stopVoice,
interrupt,
} = useTieVoiceSession({
agentId: 'my-app',
personaId: companion?.tiePersonaId,
voiceId: companion?.voiceId, // Gemini voice from companion config — overrides stored preference
skillInstructions: buildSkillInstructions(companion?.name, user?.displayName),
getAccessToken,
onToolCall: async (name, args) => {
// Handle client-side tool calls from Gemini
if (name === 'get_calendar') return { events: await fetchCalendar(args.date) };
return { error: 'Unknown tool' };
},
});

Building skillInstructions:

function buildSkillInstructions(companionName?: string, userName?: string): string {
const tz = Intl.DateTimeFormat().resolvedOptions().timeZone;
const now = new Date().toLocaleString('en-US', {
timeZone: tz,
dateStyle: 'full',
timeStyle: 'short',
});
return [
// App context — replace with your app name
'You are currently in the MyApp experience.\nKeep responses concise and conversational — this is a voice session.\nNever read out URLs, markdown, or bullet points — speak naturally.',
// Dynamic companion/user context
companionName ? `Your name is ${companionName}.` : '',
userName ? `You are speaking with ${userName}. Address them by name when natural.` : '',
// Always include timezone so the AI gives correct times
`Current time: ${now}\nUser timezone: ${tz}`,
].filter(Boolean).join('\n\n');
}

Voice mode state machine:

idle ──startVoice()──▶ listening
user speaks │
processing (transcript received, waiting for AI)
AI responds │
speaking (binary audio frames arriving)
generation_complete│ or interrupted
listening
stopVoice() │
idle

UI state indicator:

function VoiceStateIndicator({ voiceMode }: { voiceMode: string }) {
if (voiceMode === 'idle') return null;
return (
<div className={`voice-indicator ${voiceMode}`}>
{voiceMode === 'listening' && 'Listening…'}
{voiceMode === 'processing' && 'Thinking…'}
{voiceMode === 'speaking' && 'Speaking…'}
</div>
);
}

For apps not using the hook, here is the minimal connection pattern:

async function startVoiceSession(token: string, companionName: string) {
const tz = Intl.DateTimeFormat().resolvedOptions().timeZone;
const now = new Date().toLocaleString('en-US', { timeZone: tz, dateStyle: 'full', timeStyle: 'short' });
// 1. Set up capture at 16 kHz
const captureCtx = new AudioContext({ sampleRate: 16000 });
const stream = await navigator.mediaDevices.getUserMedia({
audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true },
});
const micSource = captureCtx.createMediaStreamSource(stream);
await captureCtx.audioWorklet.addModule('/worklets/pcm-processor.js');
const workletNode = new AudioWorkletNode(captureCtx, 'pcm-processor');
micSource.connect(workletNode);
// 2. Set up playback
const playbackCtx = new AudioContext();
let nextPlayTime = 0;
const activeSources: AudioBufferSourceNode[] = [];
function playPcm(data: ArrayBuffer) {
const int16 = new Int16Array(data);
const float32 = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 32768;
const buffer = playbackCtx.createBuffer(1, float32.length, 24000);
buffer.copyToChannel(float32, 0);
const source = playbackCtx.createBufferSource();
source.buffer = buffer;
source.connect(playbackCtx.destination);
const startAt = Math.max(nextPlayTime, playbackCtx.currentTime);
source.start(startAt);
nextPlayTime = startAt + buffer.duration;
activeSources.push(source);
source.onended = () => activeSources.splice(activeSources.indexOf(source), 1);
}
function flushAudio() {
activeSources.forEach(s => { try { s.stop(); } catch {} });
activeSources.length = 0;
nextPlayTime = 0;
}
// 3. Open WebSocket
const ws = new WebSocket('wss://your-tie-host/v1/voice/agent', ['tie.bearer', token]);
ws.binaryType = 'arraybuffer';
ws.onopen = () => {
ws.send(JSON.stringify({
agent_id: 'my-app',
// voice and language omitted — TIE uses the user's stored preference.
// Pass voiceId explicitly to override: voice: 'Aoede'
skill_instructions: `You are currently in the MyApp experience.\nKeep responses concise and conversational.\nCurrent time: ${now}\nUser timezone: ${tz}`,
}));
};
ws.onmessage = (event) => {
if (event.data instanceof ArrayBuffer) {
playPcm(event.data);
return;
}
const msg = JSON.parse(event.data);
if (msg.type === 'session_ready') {
// Start streaming mic audio
workletNode.port.onmessage = (e: MessageEvent<ArrayBuffer>) => {
if (ws.readyState === WebSocket.OPEN) ws.send(e.data);
};
}
if (msg.type === 'interrupted') flushAudio();
if (msg.type === 'session_resumption') localStorage.setItem('tie-voice-handle', msg.handle);
if (msg.type === 'tool_call') {
// Execute and reply
executeClientTool(msg.name, msg.args).then(result => {
ws.send(JSON.stringify({ type: 'tool_result', call_id: msg.call_id, result }));
});
}
if (msg.type === 'error') console.error('TIE error:', msg.code, msg.message);
};
// 4. Mute: disable track + signal end of audio stream
function mute() {
stream.getAudioTracks()[0].enabled = false;
ws.send(JSON.stringify({ type: 'audio_stream_end' }));
}
// 5. Disconnect cleanly
function disconnect() {
ws.send(JSON.stringify({ type: 'control', action: 'disconnect' }));
ws.close();
stream.getTracks().forEach(t => t.stop());
captureCtx.close();
playbackCtx.close();
}
return { mute, disconnect };
}

Place this file at /public/worklets/pcm-processor.js:

class PcmProcessor extends AudioWorkletProcessor {
process(inputs) {
const input = inputs[0]?.[0];
if (!input?.length) return true;
const int16 = new Int16Array(input.length);
for (let i = 0; i < input.length; i++) {
int16[i] = Math.max(-32768, Math.min(32767, Math.round(input[i] * 32767)));
}
this.port.postMessage(int16.buffer, [int16.buffer]);
return true;
}
}
registerProcessor('pcm-processor', PcmProcessor);

Store a user's preferred voice and language server-side so every session picks them up automatically without the client having to pass them each time.

GET /v1/voice/preferences
Authorization: Bearer <token>
{
"provider": "gemini",
"voice": "Kore",
"language": "en"
}

Returns stored preferences, or defaults (provider: "gemini", voice: "Kore", language: "en") if none have been saved yet.

PUT /v1/voice/preferences
Authorization: Bearer <token>
Content-Type: application/json
{
"provider": "gemini",
"voice": "Aoede",
"language": "en"
}
FieldTypeDescription
providerstringTTS provider. Currently only "gemini" is supported
voicestringVoice name (see Voices)
languagestringBCP-47 language code (e.g. "en", "es", "ms")

Responds with the saved preferences.

Preferences apply to WebSocket sessions when voice and language are omitted (or null) in the session config. Pass them explicitly in the session config to override for a single session.


Generate speech from text without opening a voice session.

POST /v1/voice/tts
{
"text": "Great job today!",
"voice": "Kore",
"language": "en",
"format": "pcm"
}
FieldTypeDefaultDescription
textstringText to synthesize (max 4000 characters)
voicestringKoreVoice name (see Voices)
languagestringenBCP-47 language code
formatstringpcmOutput container: pcm (raw bytes) or wav (with 44-byte header)
Terminal window
# Raw PCM
curl -X POST https://your-tie-host/v1/voice/tts \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{"text": "Great job today!", "voice": "Kore"}' \
--output response.pcm
# WAV — playable by any audio player without client-side transcoding
curl -X POST https://your-tie-host/v1/voice/tts \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{"text": "Great job today!", "voice": "Kore", "format": "wav"}' \
--output response.wav

pcm (default): raw PCM — 24 kHz, signed 16-bit little-endian, mono. Content-Type: audio/pcm;rate=24000.

wav: same PCM wrapped in a standard 44-byte WAV header. Content-Type: audio/wav. Playable directly in browsers and mobile apps without transcoding.


VoiceCharacter
KoreFirm, grounded
PuckPlayful, upbeat
CharonCalm, measured
FenrirConfident, direct
AoedeSmooth, warm
LedaClear, friendly
OrusSteady, authoritative
ZephyrLight, energetic

Full catalog: Gemini voice options