Building Real-time Voice Conversations with ElevenLabs WebSocket API: A Complete Development Guide

Event TypeWhen ReceivedRequired HandlingOptional Operations
conversation_initiation_metadataAfter connection establishedSave conversation_id, start recordingDisplay session information
user_transcriptAfter user speaksDisplay what user said
agent_responseAfter Agent generates responseDisplay Agent text response
agent_response_correctionWhen Agent corrects responseDisplay correction information
audioAfter Agent audio synthesisDecode and play audioDisplay playback status
interruptionWhen interruption detectedStop playback, clear queueDisplay interruption prompt
pingServer heartbeat detectionImmediately send pong
client_tool_callWhen Agent needs to call toolExecute tool and return resultDisplay tool call information
vad_scoreDuring voice activity detectionVisualize voice activity
Message TypeSend TimingFrequency
conversation_initiation_client_dataImmediately after connection establishedOnce
user_audio_chunkContinuously during recordingHigh frequency (approximately every 250ms)
user_messageWhen user inputs textOn demand
user_activityWhen need to notify user activityOn demand
pongImmediately respond when receive pingOn demand
client_tool_resultAfter tool execution completedOn demand
contextual_updateWhen need to update contextOn demand
Comparison ItemElevenLabs Agents PlatformOpenAI Realtime API
Multimodal Support❌ Not supported, i.e., does not support camera recognition (image input)✅ Supported (GPT-4o)
Voice Selection✅ 100+ preset voices, supports voice cloning⚠️ 10 preset voices
LLM Models✅ Multi-model support (ElevenLabs, OpenAI, Google, Anthropic)✅ GPT-4o, GPT-4o-mini
Knowledge Base✅ Supported✅ Supported (via Assistants API)
Function Call✅ Supported✅ Supported
Text Interrupt AI Response✅ Supported (sending text message can interrupt AI’s ongoing response)❌ Not supported
Latency✅ Depends on model (163ms-3.87s)✅ Low (300-800ms)
Pricing💰 Per-minute billing (based on model, $0.0033-$0.1956/minute)💰 Per-token billing (GPT-4o-mini more economical)
PlatformSupport StatusDetailed InformationReference Links
ElevenLabs Agents Platform❌ Currently not supportedFocuses on voice conversation, does not support visual input (camera/image recognition)ElevenLabs Agents Platform WebSocket API Documentation
OpenAI Realtime API✅ Supported (via GPT-4o)Supports visual input, can process images and video frames, supports real-time camera recognition. GPT-4o model natively supports multimodal inputOpenAI Realtime API Documentation
OpenAI GPT-4o Vision Capabilities
PlatformVoice CountVoice CharacteristicsCustomization CapabilityReference Links
ElevenLabs Agents Platform100+ preset voicesHigh quality, multilingual, supports emotional expression, voice cloningSupports custom voice ID, emotion control, tone adjustment, voice cloningElevenLabs Voice Library
ElevenLabs Voice Cloning
OpenAI Realtime API⚠️ Limited selection (10 voices)Mainly relies on TTS API, provides 10 preset voices (alloy, echo, fable, onyx, nova, shimmer…)Limited voice control capability, does not support voice cloningOpenAI TTS Documentation
OpenAI TTS Voice List
PlatformSupported ModelsModel CharacteristicsReference Links
ElevenLabs Agents PlatformMulti-model supportSupports ElevenLabs proprietary models and multiple third-party models (OpenAI, Google, Anthropic, etc.), users can choose according to needs, supports custom LLM parametersElevenLabs Agents Documentation
ElevenLabs LLM Configuration
OpenAI Realtime APIGPT-4o, GPT-4o-miniSupports GPT-4o (multimodal, stronger capabilities) and GPT-4o-mini (lightweight, faster, lower cost), can switch modelsOpenAI Realtime API Models
OpenAI Model Comparison
PlatformKnowledge Base SupportImplementation MethodReference Links
ElevenLabs Agents PlatformSupportedSupports knowledge base integration through Agent configuration, can upload documents and set up knowledge base, Agent can reference knowledge base content in conversationsElevenLabs Agents Documentation
ElevenLabs Agent Configuration
OpenAI Realtime APISupported (via Assistants API or Function Calling)Can integrate knowledge base through Assistants API (file upload, vector storage), or access external data sources and APIs through function callingOpenAI Assistants API
OpenAI Function Calling
PlatformSupport StatusImplementation MethodReference Links
ElevenLabs Agents PlatformSupportedImplements tool calling through client_tool_call and client_tool_result message types, supports defining tools in AgentElevenLabs WebSocket API – Tool Calling
ElevenLabs Agent Tool Configuration
OpenAI Realtime APISupportedImplements function calling through tool_calls and tool_results events, supports defining tools in sessionsOpenAI Realtime API – Function Calling
OpenAI Function Calling Guide
PlatformSupport StatusDetailed InformationReference Links
ElevenLabs Agents PlatformSupportedSending text message (user_message) can interrupt AI’s ongoing voice response, achieving more natural conversation interactionElevenLabs WebSocket API – User Message
OpenAI Realtime APINot supportedSending text message cannot interrupt AI’s ongoing response, need to wait for current response to completeOpenAI Realtime API Documentation
PlatformLatency PerformanceOptimization FeaturesReference Links
ElevenLabs Agents PlatformDepends on model selectionLatency ranges from 163ms to 3.87s, depending on the selected LLM model. Low-latency models like Qwen3-30B-A3B (~163ms) are suitable for real-time interaction, high-performance models like GPT-5 (~1.14s) or Claude Sonnet (~1.5s) have higher latency but stronger capabilities. Supports streaming responseElevenLabs Agents Platform Documentation
ElevenLabs WebSocket API
OpenAI Realtime APILow latencyReal-time streaming response, latency typically 300-800ms (depends on model and network), GPT-4o-mini is usually fasterOpenAI Realtime API Documentation
OpenAI Performance Optimization
PlatformBilling MethodPrice DetailsReference Links
ElevenLabs Agents Platform💰 Per-conversation minute billing (based on selected model)Price depends on selected LLM model, usually includes comprehensive fees for voice synthesis, speech recognition, and LLM calls. For specific model prices, please refer to the “Supported LLM Models” section aboveElevenLabs Pricing Page
ElevenLabs Billing Instructions
OpenAI Realtime API💰 Per-token and audio duration billingGPT-4o: Input $2.50/1M tokens, Output $10/1M tokens
GPT-4o-mini: Input $0.15/1M tokens, Output $0.60/1M tokens
Audio input/output: $0.015/minute
(Prices may change over time)
OpenAI Pricing Page
OpenAI Realtime API Pricing

Leave a Reply

Your email address will not be published. Required fields are marked *

Latest Posts