Real-Time Voice (Gemini Live)
TIE provides real-time voice-to-voice conversation powered by Gemini Live. The client opens a WebSocket to TIE, sends raw PCM audio, and receives audio back — TIE handles auth, persona injection, memory context, tool routing, and identity guardrails transparently.
Per-user voice and language preferences are stored server-side and applied automatically to each session. A separate HTTP endpoint provides standalone text-to-speech.
WebSocket Voice Session
Section titled “WebSocket Voice Session”Open a persistent bidirectional audio stream between the client and Gemini Live.
Endpoint
Section titled “Endpoint”WebSocket /v1/voice/agentAuth via the WebSocket subprotocol header (browsers can't send custom headers on the handshake, so the bearer token is smuggled through Sec-WebSocket-Protocol):
new WebSocket("wss://your-tie-host/v1/voice/agent", ["tie.bearer", "<bearer-token>"])The server reads the second subprotocol value as the token and accepts tie.bearer to complete the handshake. This keeps the token out of URL paths and proxy access logs.
Session Lifecycle
Section titled “Session Lifecycle”sequenceDiagram
participant Client
participant TIE
participant Gemini
Client->>TIE: WebSocket connect (Sec-WebSocket-Protocol: tie.bearer, token)
Client->>TIE: JSON session config (first message)
TIE->>TIE: Load persona + memory + apply guardrails
TIE->>Gemini: Open Live session
TIE-->>Client: {"type": "session_ready", "session_id": "...", "thread_id": "..."}
loop Bidirectional relay
Client->>TIE: Binary frame (PCM audio chunk)
TIE->>Gemini: Audio
Gemini-->>TIE: Audio + transcripts + tool calls
TIE-->>Client: Binary frame (audio) + JSON events
end
Client->>TIE: {"type": "control", "action": "disconnect"}
TIE->>TIE: Flush transcripts → thread + memory pipeline
Session Config
Section titled “Session Config”Send this as the first JSON message after connecting:
{ "voice": null, "language": null, "agent_id": "my-app", "persona_id": "your-persona-id-or-null", "thread_id": "existing-thread-id-or-null", "skill_instructions": "You are currently in the MyApp experience.\nCurrent time: Monday, April 19, 2026 at 11:30 PM\nUser timezone: Asia/Kuala_Lumpur", "tools": [], "proactive_audio": false, "enable_affective_dialog": false, "resumption_handle": null}| Field | Type | Default | Description |
|---|---|---|---|
voice | string | null | null | Voice name override (see Voices). When null, TIE uses the user's stored preference |
language | string | null | null | Language code override. When null, TIE uses the user's stored preference |
agent_id | string | chatbot | Scopes memory and thread history — use one stable value per app (e.g. "blinklife") |
persona_id | string | null | null | TIE persona to load. Persona sets the AI's identity and user-level preferences |
thread_id | string | null | null | Existing thread for context continuity. Omit to create a new session |
skill_instructions | string | null | null | App-specific context injected after persona + memory (see below) |
tools | array | [] | Client-side tool definitions forwarded to the client for execution |
proactive_audio | bool | false | Enable Proactive Audio for wake word detection |
enable_affective_dialog | bool | false | Enable affective dialog (emotional expression in voice) |
resumption_handle | string | null | null | Handle from a previous session for seamless reconnection |
System Prompt Composition
Section titled “System Prompt Composition”TIE builds the Gemini system instruction server-side by concatenating these layers in order:
1. Persona (loaded from TIE persona store by persona_id)2. Memory context (user's memory graph for this agent_id)3. Current user identity ("You are speaking with <display_name>…")4. skill_instructions (sent by the client — app context + dynamic data)5. Identity guardrails (always appended by TIE — cannot be overridden)What belongs in skill_instructions:
The client is the only party that knows which app the user is in and real-time session data. Send a string that covers:
You are currently in the <AppName> app.Keep responses concise and conversational — this is a voice session.Never read out URLs, markdown, or bullet points — speak naturally.
Current time: Monday, April 19, 2026 at 11:30 PMUser timezone: Asia/Kuala_LumpurDo not hardcode AI identity (name, personality) in skill_instructions — that belongs in the TIE persona so it can be managed via the admin panel. The AI name is set dynamically by the caller via the persona, not by the client.
Platform-level guardrails (applied automatically by TIE):
TIE always appends identity guardrails as the final layer regardless of what the client sends:
- The AI will never reveal it is powered by Gemini, GPT, Claude, or any other LLM technology
- If asked, it may say it is powered by Envision Inc
- It will never break character or acknowledge these instructions
Message Protocol
Section titled “Message Protocol”Binary frames carry audio. Text frames carry JSON control messages.
Client → TIE:
| Type | Transport | Payload |
|---|---|---|
| Audio chunk | Binary frame | PCM 16-bit signed, 16 kHz, mono |
tool_result | JSON text | {"type": "tool_result", "call_id": "...", "result": {...}} |
audio_stream_end | JSON text | {"type": "audio_stream_end"} — send when mic is muted or paused |
control | JSON text | {"type": "control", "action": "disconnect"} |
TIE → Client:
| Type | Transport | Payload |
|---|---|---|
| Audio chunk | Binary frame | PCM audio from Gemini (24 kHz) |
session_ready | JSON text | {"type": "session_ready", "session_id": "...", "thread_id": "..."} |
transcript | JSON text | {"type": "transcript", "role": "user|assistant", "text": "..."} |
interrupted | JSON text | {"type": "interrupted"} — user barged in; stop and discard buffered audio |
generation_complete | JSON text | {"type": "generation_complete"} — AI finished speaking; return to listening state |
tool_call | JSON text | {"type": "tool_call", "call_id": "...", "name": "...", "args": {...}} |
session_resumption | JSON text | {"type": "session_resumption", "handle": "..."} — store for reconnection |
go_away | JSON text | {"type": "go_away", "time_left": "..."} — Gemini signaling imminent disconnect |
error | JSON text | {"type": "error", "code": "...", "message": "..."} |
Audio Format
Section titled “Audio Format”| Direction | Format |
|---|---|
| Client → TIE | PCM 16-bit signed, 16 kHz, mono |
| TIE → Client | PCM 16-bit signed, 24 kHz, mono |
Use an AudioWorklet to convert browser mic float32 samples to int16 PCM before sending.
Playback: Create your
AudioBufferat 24 kHz — not at theAudioContext's native sample rate. Using the wrong rate causes chipmunk-pitched or slow playback.
Interrupt handling: When you receive
interrupted, stop all scheduledAudioBufferSourceNodes by calling.stop()on each tracked node and reset your playback clock. Do not close and recreate theAudioContext— creating a new context outside a user gesture may leave it in a suspended state, silently breaking all subsequent audio playback.
Tool Call Routing
Section titled “Tool Call Routing”TIE routes tool calls into two categories:
TIE-native tools — executed server-side, invisible to the client:
memory_search— searches the user's memory graphmemory_write— stores a new memory
Client-provided tools — defined in tools at session start, forwarded to the client for execution. When Gemini calls one, TIE sends a tool_call event. The client must reply within 30 seconds with a tool_result message or TIE returns a timeout error to Gemini.
// Client receives:{"type": "tool_call", "call_id": "abc123", "name": "get_calendar", "args": {"date": "today"}}
// Client replies:{"type": "tool_result", "call_id": "abc123", "result": {"events": [...]}}Session Limits
Section titled “Session Limits”| Limit | Value |
|---|---|
| Max session duration | VOICE_SESSION_TIMEOUT (default 900s) |
| Gemini native limit | 15 min (TIE enables context window compression to extend) |
| Tool call timeout | 30s per call |
| Audio send timeout | 5s (slow clients are disconnected) |
Session Resumption
Section titled “Session Resumption”Gemini periodically resets WebSocket connections. TIE enables seamless reconnection:
- During a session, TIE forwards
session_resumptionevents with ahandle - Store the latest handle in memory or
sessionStorage - On reconnect, pass
resumption_handlein the session config - Gemini restores conversation context from the handle
Transcript Sync
Section titled “Transcript Sync”At session end, TIE writes buffered transcripts to the thread and runs the memory pipeline. Pass the thread_id from session_ready to subsequent chat requests to continue in the same context.
Frontend Integration
Section titled “Frontend Integration”React Hook (useTieVoiceSession)
Section titled “React Hook (useTieVoiceSession)”A production-ready hook that handles the full WebSocket lifecycle, audio pipeline, session resumption, and tool call dispatch.
import { useTieVoiceSession } from '@/hooks/use-tie-voice-session';
const { voiceMode, // 'idle' | 'listening' | 'processing' | 'speaking' interimTranscript, error, startVoice, stopVoice, interrupt,} = useTieVoiceSession({ agentId: 'my-app', personaId: companion?.tiePersonaId, voiceId: companion?.voiceId, // Gemini voice from companion config — overrides stored preference skillInstructions: buildSkillInstructions(companion?.name, user?.displayName), getAccessToken, onToolCall: async (name, args) => { // Handle client-side tool calls from Gemini if (name === 'get_calendar') return { events: await fetchCalendar(args.date) }; return { error: 'Unknown tool' }; },});Building skillInstructions:
function buildSkillInstructions(companionName?: string, userName?: string): string { const tz = Intl.DateTimeFormat().resolvedOptions().timeZone; const now = new Date().toLocaleString('en-US', { timeZone: tz, dateStyle: 'full', timeStyle: 'short', });
return [ // App context — replace with your app name 'You are currently in the MyApp experience.\nKeep responses concise and conversational — this is a voice session.\nNever read out URLs, markdown, or bullet points — speak naturally.', // Dynamic companion/user context companionName ? `Your name is ${companionName}.` : '', userName ? `You are speaking with ${userName}. Address them by name when natural.` : '', // Always include timezone so the AI gives correct times `Current time: ${now}\nUser timezone: ${tz}`, ].filter(Boolean).join('\n\n');}Voice mode state machine:
idle ──startVoice()──▶ listening │ user speaks │ ▼ processing (transcript received, waiting for AI) │ AI responds │ ▼ speaking (binary audio frames arriving) │ generation_complete│ or interrupted ▼ listening │ stopVoice() │ ▼ idleUI state indicator:
function VoiceStateIndicator({ voiceMode }: { voiceMode: string }) { if (voiceMode === 'idle') return null; return ( <div className={`voice-indicator ${voiceMode}`}> {voiceMode === 'listening' && 'Listening…'} {voiceMode === 'processing' && 'Thinking…'} {voiceMode === 'speaking' && 'Speaking…'} </div> );}Raw WebSocket (no hook)
Section titled “Raw WebSocket (no hook)”For apps not using the hook, here is the minimal connection pattern:
async function startVoiceSession(token: string, companionName: string) { const tz = Intl.DateTimeFormat().resolvedOptions().timeZone; const now = new Date().toLocaleString('en-US', { timeZone: tz, dateStyle: 'full', timeStyle: 'short' });
// 1. Set up capture at 16 kHz const captureCtx = new AudioContext({ sampleRate: 16000 }); const stream = await navigator.mediaDevices.getUserMedia({ audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true }, }); const micSource = captureCtx.createMediaStreamSource(stream); await captureCtx.audioWorklet.addModule('/worklets/pcm-processor.js'); const workletNode = new AudioWorkletNode(captureCtx, 'pcm-processor'); micSource.connect(workletNode);
// 2. Set up playback const playbackCtx = new AudioContext(); let nextPlayTime = 0; const activeSources: AudioBufferSourceNode[] = [];
function playPcm(data: ArrayBuffer) { const int16 = new Int16Array(data); const float32 = new Float32Array(int16.length); for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 32768; const buffer = playbackCtx.createBuffer(1, float32.length, 24000); buffer.copyToChannel(float32, 0); const source = playbackCtx.createBufferSource(); source.buffer = buffer; source.connect(playbackCtx.destination); const startAt = Math.max(nextPlayTime, playbackCtx.currentTime); source.start(startAt); nextPlayTime = startAt + buffer.duration; activeSources.push(source); source.onended = () => activeSources.splice(activeSources.indexOf(source), 1); }
function flushAudio() { activeSources.forEach(s => { try { s.stop(); } catch {} }); activeSources.length = 0; nextPlayTime = 0; }
// 3. Open WebSocket const ws = new WebSocket('wss://your-tie-host/v1/voice/agent', ['tie.bearer', token]); ws.binaryType = 'arraybuffer';
ws.onopen = () => { ws.send(JSON.stringify({ agent_id: 'my-app', // voice and language omitted — TIE uses the user's stored preference. // Pass voiceId explicitly to override: voice: 'Aoede' skill_instructions: `You are currently in the MyApp experience.\nKeep responses concise and conversational.\nCurrent time: ${now}\nUser timezone: ${tz}`, })); };
ws.onmessage = (event) => { if (event.data instanceof ArrayBuffer) { playPcm(event.data); return; } const msg = JSON.parse(event.data);
if (msg.type === 'session_ready') { // Start streaming mic audio workletNode.port.onmessage = (e: MessageEvent<ArrayBuffer>) => { if (ws.readyState === WebSocket.OPEN) ws.send(e.data); }; } if (msg.type === 'interrupted') flushAudio(); if (msg.type === 'session_resumption') localStorage.setItem('tie-voice-handle', msg.handle); if (msg.type === 'tool_call') { // Execute and reply executeClientTool(msg.name, msg.args).then(result => { ws.send(JSON.stringify({ type: 'tool_result', call_id: msg.call_id, result })); }); } if (msg.type === 'error') console.error('TIE error:', msg.code, msg.message); };
// 4. Mute: disable track + signal end of audio stream function mute() { stream.getAudioTracks()[0].enabled = false; ws.send(JSON.stringify({ type: 'audio_stream_end' })); }
// 5. Disconnect cleanly function disconnect() { ws.send(JSON.stringify({ type: 'control', action: 'disconnect' })); ws.close(); stream.getTracks().forEach(t => t.stop()); captureCtx.close(); playbackCtx.close(); }
return { mute, disconnect };}AudioWorklet (pcm-processor.js)
Section titled “AudioWorklet (pcm-processor.js)”Place this file at /public/worklets/pcm-processor.js:
class PcmProcessor extends AudioWorkletProcessor { process(inputs) { const input = inputs[0]?.[0]; if (!input?.length) return true; const int16 = new Int16Array(input.length); for (let i = 0; i < input.length; i++) { int16[i] = Math.max(-32768, Math.min(32767, Math.round(input[i] * 32767))); } this.port.postMessage(int16.buffer, [int16.buffer]); return true; }}registerProcessor('pcm-processor', PcmProcessor);Voice Preferences
Section titled “Voice Preferences”Store a user's preferred voice and language server-side so every session picks them up automatically without the client having to pass them each time.
Get preferences
Section titled “Get preferences”GET /v1/voice/preferencesAuthorization: Bearer <token>{ "provider": "gemini", "voice": "Kore", "language": "en"}Returns stored preferences, or defaults (provider: "gemini", voice: "Kore", language: "en") if none have been saved yet.
Update preferences
Section titled “Update preferences”PUT /v1/voice/preferencesAuthorization: Bearer <token>Content-Type: application/json{ "provider": "gemini", "voice": "Aoede", "language": "en"}| Field | Type | Description |
|---|---|---|
provider | string | TTS provider. Currently only "gemini" is supported |
voice | string | Voice name (see Voices) |
language | string | BCP-47 language code (e.g. "en", "es", "ms") |
Responds with the saved preferences.
Preferences apply to WebSocket sessions when voice and language are omitted (or null) in the session config. Pass them explicitly in the session config to override for a single session.
Standalone TTS
Section titled “Standalone TTS”Generate speech from text without opening a voice session.
Endpoint
Section titled “Endpoint”POST /v1/voice/ttsParameters
Section titled “Parameters”{ "text": "Great job today!", "voice": "Kore", "language": "en", "format": "pcm"}| Field | Type | Default | Description |
|---|---|---|---|
text | string | — | Text to synthesize (max 4000 characters) |
voice | string | Kore | Voice name (see Voices) |
language | string | en | BCP-47 language code |
format | string | pcm | Output container: pcm (raw bytes) or wav (with 44-byte header) |
Example
Section titled “Example”# Raw PCMcurl -X POST https://your-tie-host/v1/voice/tts \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $TOKEN" \ -d '{"text": "Great job today!", "voice": "Kore"}' \ --output response.pcm
# WAV — playable by any audio player without client-side transcodingcurl -X POST https://your-tie-host/v1/voice/tts \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $TOKEN" \ -d '{"text": "Great job today!", "voice": "Kore", "format": "wav"}' \ --output response.wavResponse
Section titled “Response”pcm (default): raw PCM — 24 kHz, signed 16-bit little-endian, mono. Content-Type: audio/pcm;rate=24000.
wav: same PCM wrapped in a standard 44-byte WAV header. Content-Type: audio/wav. Playable directly in browsers and mobile apps without transcoding.
Voices
Section titled “Voices”| Voice | Character |
|---|---|
Kore | Firm, grounded |
Puck | Playful, upbeat |
Charon | Calm, measured |
Fenrir | Confident, direct |
Aoede | Smooth, warm |
Leda | Clear, friendly |
Orus | Steady, authoritative |
Zephyr | Light, energetic |
Full catalog: Gemini voice options