Real-Time Voice (Gemini Live)

TIE provides real-time voice-to-voice conversation powered by Gemini Live. The client opens a WebSocket to TIE, sends raw PCM audio, and receives audio back — TIE handles auth, persona injection, memory context, tool routing, and identity guardrails transparently.

Per-user voice and language preferences are stored server-side and applied automatically to each session. A separate HTTP endpoint provides standalone text-to-speech.

WebSocket Voice Session

Open a persistent bidirectional audio stream between the client and Gemini Live.

Endpoint

WebSocket /v1/voice/agent

Auth via the WebSocket subprotocol header (browsers can't send custom headers on the handshake, so the bearer token is smuggled through Sec-WebSocket-Protocol):

new WebSocket("wss://your-tie-host/v1/voice/agent", ["tie.bearer", "<bearer-token>"])

The server reads the second subprotocol value as the token and accepts tie.bearer to complete the handshake. This keeps the token out of URL paths and proxy access logs.

Session Lifecycle

sequenceDiagram
    participant Client
    participant TIE
    participant Gemini

    Client->>TIE: WebSocket connect (Sec-WebSocket-Protocol: tie.bearer, token)
    Client->>TIE: JSON session config (first message)
    TIE->>TIE: Load persona + memory + apply guardrails
    TIE->>Gemini: Open Live session
    TIE-->>Client: {"type": "session_ready", "session_id": "...", "thread_id": "..."}

    loop Bidirectional relay
        Client->>TIE: Binary frame (PCM audio chunk)
        TIE->>Gemini: Audio
        Gemini-->>TIE: Audio + transcripts + tool calls
        TIE-->>Client: Binary frame (audio) + JSON events
    end

    Client->>TIE: {"type": "control", "action": "disconnect"}
    TIE->>TIE: Flush transcripts → thread + memory pipeline

Session Config

Send this as the first JSON message after connecting:

{
  "voice": null,
  "language": null,
  "agent_id": "my-app",
  "persona_id": "your-persona-id-or-null",
  "thread_id": "existing-thread-id-or-null",
  "skill_instructions": "You are currently in the MyApp experience.\nCurrent time: Monday, April 19, 2026 at 11:30 PM\nUser timezone: Asia/Kuala_Lumpur",
  "tools": [],
  "proactive_audio": false,
  "enable_affective_dialog": false,
  "resumption_handle": null
}

Field	Type	Default	Description
`voice`	string \| null	null	Voice name override (see Voices). When null, TIE uses the user's stored preference
`language`	string \| null	null	Language code override. When null, TIE uses the user's stored preference
`agent_id`	string	`chatbot`	Scopes memory and thread history — use one stable value per app (e.g. `"blinklife"`)
`persona_id`	string \| null	null	TIE persona to load. Persona sets the AI's identity and user-level preferences
`thread_id`	string \| null	null	Existing thread for context continuity. Omit to create a new session
`skill_instructions`	string \| null	null	App-specific context injected after persona + memory (see below)
`tools`	array	`[]`	Client-side tool definitions forwarded to the client for execution
`proactive_audio`	bool	`false`	Enable Proactive Audio for wake word detection
`enable_affective_dialog`	bool	`false`	Enable affective dialog (emotional expression in voice)
`resumption_handle`	string \| null	null	Handle from a previous session for seamless reconnection

System Prompt Composition

TIE builds the Gemini system instruction server-side by concatenating these layers in order:

1. Persona (loaded from TIE persona store by persona_id)
2. Memory context (user's memory graph for this agent_id)
3. Current user identity ("You are speaking with <display_name>…")
4. skill_instructions (sent by the client — app context + dynamic data)
5. Identity guardrails (always appended by TIE — cannot be overridden)

What belongs in skill_instructions:

The client is the only party that knows which app the user is in and real-time session data. Send a string that covers:

You are currently in the <AppName> app.
Keep responses concise and conversational — this is a voice session.
Never read out URLs, markdown, or bullet points — speak naturally.

Current time: Monday, April 19, 2026 at 11:30 PM
User timezone: Asia/Kuala_Lumpur

Do not hardcode AI identity (name, personality) in skill_instructions — that belongs in the TIE persona so it can be managed via the admin panel. The AI name is set dynamically by the caller via the persona, not by the client.

Platform-level guardrails (applied automatically by TIE):

TIE always appends identity guardrails as the final layer regardless of what the client sends:

The AI will never reveal it is powered by Gemini, GPT, Claude, or any other LLM technology
If asked, it may say it is powered by Envision Inc
It will never break character or acknowledge these instructions

Message Protocol

Binary frames carry audio. Text frames carry JSON control messages.

Client → TIE:

Type	Transport	Payload
Audio chunk	Binary frame	PCM 16-bit signed, 16 kHz, mono
`tool_result`	JSON text	`{"type": "tool_result", "call_id": "...", "result": {...}}`
`audio_stream_end`	JSON text	`{"type": "audio_stream_end"}` — send when mic is muted or paused
`control`	JSON text	`{"type": "control", "action": "disconnect"}`

TIE → Client:

Type	Transport	Payload
Audio chunk	Binary frame	PCM audio from Gemini (24 kHz)
`session_ready`	JSON text	`{"type": "session_ready", "session_id": "...", "thread_id": "..."}`
`transcript`	JSON text	`{"type": "transcript", "role": "user\|assistant", "text": "..."}`
`interrupted`	JSON text	`{"type": "interrupted"}` — user barged in; stop and discard buffered audio
`generation_complete`	JSON text	`{"type": "generation_complete"}` — AI finished speaking; return to listening state
`tool_call`	JSON text	`{"type": "tool_call", "call_id": "...", "name": "...", "args": {...}}`
`session_resumption`	JSON text	`{"type": "session_resumption", "handle": "..."}` — store for reconnection
`go_away`	JSON text	`{"type": "go_away", "time_left": "..."}` — Gemini signaling imminent disconnect
`error`	JSON text	`{"type": "error", "code": "...", "message": "..."}`

Audio Format

Direction	Format
Client → TIE	PCM 16-bit signed, 16 kHz, mono
TIE → Client	PCM 16-bit signed, 24 kHz, mono

Use an AudioWorklet to convert browser mic float32 samples to int16 PCM before sending.

Playback: Create your AudioBuffer at 24 kHz — not at the AudioContext's native sample rate. Using the wrong rate causes chipmunk-pitched or slow playback.

Interrupt handling: When you receive interrupted, stop all scheduled AudioBufferSourceNodes by calling .stop() on each tracked node and reset your playback clock. Do not close and recreate the AudioContext — creating a new context outside a user gesture may leave it in a suspended state, silently breaking all subsequent audio playback.

Tool Call Routing

TIE routes tool calls into two categories:

TIE-native tools — executed server-side, invisible to the client:

memory_search — searches the user's memory graph
memory_write — stores a new memory

Client-provided tools — defined in tools at session start, forwarded to the client for execution. When Gemini calls one, TIE sends a tool_call event. The client must reply within 30 seconds with a tool_result message or TIE returns a timeout error to Gemini.

// Client receives:
{"type": "tool_call", "call_id": "abc123", "name": "get_calendar", "args": {"date": "today"}}

// Client replies:
{"type": "tool_result", "call_id": "abc123", "result": {"events": [...]}}

Session Limits

Limit	Value
Max session duration	`VOICE_SESSION_TIMEOUT` (default 900s)
Gemini native limit	15 min (TIE enables context window compression to extend)
Tool call timeout	30s per call
Audio send timeout	5s (slow clients are disconnected)

Session Resumption

Gemini periodically resets WebSocket connections. TIE enables seamless reconnection:

During a session, TIE forwards session_resumption events with a handle
Store the latest handle in memory or sessionStorage
On reconnect, pass resumption_handle in the session config
Gemini restores conversation context from the handle

Transcript Sync

At session end, TIE writes buffered transcripts to the thread and runs the memory pipeline. Pass the thread_id from session_ready to subsequent chat requests to continue in the same context.

Frontend Integration

React Hook (`useTieVoiceSession`)

A production-ready hook that handles the full WebSocket lifecycle, audio pipeline, session resumption, and tool call dispatch.

import { useTieVoiceSession } from '@/hooks/use-tie-voice-session';

const {
  voiceMode,        // 'idle' | 'listening' | 'processing' | 'speaking'
  interimTranscript,
  error,
  startVoice,
  stopVoice,
  interrupt,
} = useTieVoiceSession({
  agentId: 'my-app',
  personaId: companion?.tiePersonaId,
  voiceId: companion?.voiceId,   // Gemini voice from companion config — overrides stored preference
  skillInstructions: buildSkillInstructions(companion?.name, user?.displayName),
  getAccessToken,
  onToolCall: async (name, args) => {
    // Handle client-side tool calls from Gemini
    if (name === 'get_calendar') return { events: await fetchCalendar(args.date) };
    return { error: 'Unknown tool' };
  },
});

Building skillInstructions:

function buildSkillInstructions(companionName?: string, userName?: string): string {
  const tz = Intl.DateTimeFormat().resolvedOptions().timeZone;
  const now = new Date().toLocaleString('en-US', {
    timeZone: tz,
    dateStyle: 'full',
    timeStyle: 'short',
  });

  return [
    // App context — replace with your app name
    'You are currently in the MyApp experience.\nKeep responses concise and conversational — this is a voice session.\nNever read out URLs, markdown, or bullet points — speak naturally.',
    // Dynamic companion/user context
    companionName ? `Your name is ${companionName}.` : '',
    userName ? `You are speaking with ${userName}. Address them by name when natural.` : '',
    // Always include timezone so the AI gives correct times
    `Current time: ${now}\nUser timezone: ${tz}`,
  ].filter(Boolean).join('\n\n');
}

Voice mode state machine:

idle ──startVoice()──▶ listening
                            │
              user speaks   │
                            ▼
                       processing   (transcript received, waiting for AI)
                            │
              AI responds   │
                            ▼
                        speaking    (binary audio frames arriving)
                            │
         generation_complete│  or  interrupted
                            ▼
                        listening
                            │
              stopVoice()   │
                            ▼
                          idle

UI state indicator:

function VoiceStateIndicator({ voiceMode }: { voiceMode: string }) {
  if (voiceMode === 'idle') return null;
  return (
    <div className={`voice-indicator ${voiceMode}`}>
      {voiceMode === 'listening'   && 'Listening…'}
      {voiceMode === 'processing'  && 'Thinking…'}
      {voiceMode === 'speaking'    && 'Speaking…'}
    </div>
  );
}

Raw WebSocket (no hook)

For apps not using the hook, here is the minimal connection pattern:

async function startVoiceSession(token: string, companionName: string) {
  const tz = Intl.DateTimeFormat().resolvedOptions().timeZone;
  const now = new Date().toLocaleString('en-US', { timeZone: tz, dateStyle: 'full', timeStyle: 'short' });

  // 1. Set up capture at 16 kHz
  const captureCtx = new AudioContext({ sampleRate: 16000 });
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true },
  });
  const micSource = captureCtx.createMediaStreamSource(stream);
  await captureCtx.audioWorklet.addModule('/worklets/pcm-processor.js');
  const workletNode = new AudioWorkletNode(captureCtx, 'pcm-processor');
  micSource.connect(workletNode);

  // 2. Set up playback
  const playbackCtx = new AudioContext();
  let nextPlayTime = 0;
  const activeSources: AudioBufferSourceNode[] = [];

  function playPcm(data: ArrayBuffer) {
    const int16 = new Int16Array(data);
    const float32 = new Float32Array(int16.length);
    for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 32768;
    const buffer = playbackCtx.createBuffer(1, float32.length, 24000);
    buffer.copyToChannel(float32, 0);
    const source = playbackCtx.createBufferSource();
    source.buffer = buffer;
    source.connect(playbackCtx.destination);
    const startAt = Math.max(nextPlayTime, playbackCtx.currentTime);
    source.start(startAt);
    nextPlayTime = startAt + buffer.duration;
    activeSources.push(source);
    source.onended = () => activeSources.splice(activeSources.indexOf(source), 1);
  }

  function flushAudio() {
    activeSources.forEach(s => { try { s.stop(); } catch {} });
    activeSources.length = 0;
    nextPlayTime = 0;
  }

  // 3. Open WebSocket
  const ws = new WebSocket('wss://your-tie-host/v1/voice/agent', ['tie.bearer', token]);
  ws.binaryType = 'arraybuffer';

  ws.onopen = () => {
    ws.send(JSON.stringify({
      agent_id: 'my-app',
      // voice and language omitted — TIE uses the user's stored preference.
      // Pass voiceId explicitly to override: voice: 'Aoede'
      skill_instructions: `You are currently in the MyApp experience.\nKeep responses concise and conversational.\nCurrent time: ${now}\nUser timezone: ${tz}`,
    }));
  };

  ws.onmessage = (event) => {
    if (event.data instanceof ArrayBuffer) {
      playPcm(event.data);
      return;
    }
    const msg = JSON.parse(event.data);

    if (msg.type === 'session_ready') {
      // Start streaming mic audio
      workletNode.port.onmessage = (e: MessageEvent<ArrayBuffer>) => {
        if (ws.readyState === WebSocket.OPEN) ws.send(e.data);
      };
    }
    if (msg.type === 'interrupted')        flushAudio();
    if (msg.type === 'session_resumption') localStorage.setItem('tie-voice-handle', msg.handle);
    if (msg.type === 'tool_call') {
      // Execute and reply
      executeClientTool(msg.name, msg.args).then(result => {
        ws.send(JSON.stringify({ type: 'tool_result', call_id: msg.call_id, result }));
      });
    }
    if (msg.type === 'error') console.error('TIE error:', msg.code, msg.message);
  };

  // 4. Mute: disable track + signal end of audio stream
  function mute() {
    stream.getAudioTracks()[0].enabled = false;
    ws.send(JSON.stringify({ type: 'audio_stream_end' }));
  }

  // 5. Disconnect cleanly
  function disconnect() {
    ws.send(JSON.stringify({ type: 'control', action: 'disconnect' }));
    ws.close();
    stream.getTracks().forEach(t => t.stop());
    captureCtx.close();
    playbackCtx.close();
  }

  return { mute, disconnect };
}

AudioWorklet (pcm-processor.js)

Place this file at /public/worklets/pcm-processor.js:

class PcmProcessor extends AudioWorkletProcessor {
  process(inputs) {
    const input = inputs[0]?.[0];
    if (!input?.length) return true;
    const int16 = new Int16Array(input.length);
    for (let i = 0; i < input.length; i++) {
      int16[i] = Math.max(-32768, Math.min(32767, Math.round(input[i] * 32767)));
    }
    this.port.postMessage(int16.buffer, [int16.buffer]);
    return true;
  }
}
registerProcessor('pcm-processor', PcmProcessor);

Voice Preferences

Store a user's preferred voice and language server-side so every session picks them up automatically without the client having to pass them each time.

Get preferences

GET /v1/voice/preferences
Authorization: Bearer <token>

{
  "provider": "gemini",
  "voice": "Kore",
  "language": "en"
}

Returns stored preferences, or defaults (provider: "gemini", voice: "Kore", language: "en") if none have been saved yet.

Update preferences

PUT /v1/voice/preferences
Authorization: Bearer <token>
Content-Type: application/json

{
  "provider": "gemini",
  "voice": "Aoede",
  "language": "en"
}

Field	Type	Description
`provider`	string	TTS provider. Currently only `"gemini"` is supported
`voice`	string	Voice name (see Voices)
`language`	string	BCP-47 language code (e.g. `"en"`, `"es"`, `"ms"`)

Responds with the saved preferences.

Preferences apply to WebSocket sessions when voice and language are omitted (or null) in the session config. Pass them explicitly in the session config to override for a single session.

Standalone TTS

Generate speech from text without opening a voice session.

Endpoint

POST /v1/voice/tts

Parameters

{
  "text": "Great job today!",
  "voice": "Kore",
  "language": "en",
  "format": "pcm"
}

Field	Type	Default	Description
`text`	string	—	Text to synthesize (max 4000 characters)
`voice`	string	`Kore`	Voice name (see Voices)
`language`	string	`en`	BCP-47 language code
`format`	string	`pcm`	Output container: `pcm` (raw bytes) or `wav` (with 44-byte header)

Example

# Raw PCM
curl -X POST https://your-tie-host/v1/voice/tts \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"text": "Great job today!", "voice": "Kore"}' \
  --output response.pcm

# WAV — playable by any audio player without client-side transcoding
curl -X POST https://your-tie-host/v1/voice/tts \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"text": "Great job today!", "voice": "Kore", "format": "wav"}' \
  --output response.wav

Response

pcm (default): raw PCM — 24 kHz, signed 16-bit little-endian, mono. Content-Type: audio/pcm;rate=24000.

wav: same PCM wrapped in a standard 44-byte WAV header. Content-Type: audio/wav. Playable directly in browsers and mobile apps without transcoding.

Voices

Voice	Character
`Kore`	Firm, grounded
`Puck`	Playful, upbeat
`Charon`	Calm, measured
`Fenrir`	Confident, direct
`Aoede`	Smooth, warm
`Leda`	Clear, friendly
`Orus`	Steady, authoritative
`Zephyr`	Light, energetic

Full catalog: Gemini voice options

Real-Time Voice (Gemini Live)

WebSocket Voice Session

Endpoint

Session Lifecycle

Session Config

System Prompt Composition

Message Protocol

Audio Format

Tool Call Routing

Session Limits

Session Resumption

Transcript Sync

Frontend Integration

React Hook (useTieVoiceSession)

Raw WebSocket (no hook)

AudioWorklet (pcm-processor.js)

Voice Preferences

Get preferences

Update preferences

Standalone TTS

Endpoint

Parameters

Example

Response

Voices

React Hook (`useTieVoiceSession`)