Your cart is currently empty!
Digital Human Series (5): Transitioning from WebSocket + MainSource to WebRTC Video Streaming in a Real-Time Digital Human System Based on MuseTalk + Realtime API
BY

1. Introduction: The Rise of Digital Human Technology and the Challenges of Lip Sync
With the rapid advancement of digital human technology, Lip Sync technology has reached a level where it can generate highly realistic virtual character videos, bringing digital human expressiveness to an unprecedented level. However, generating high-quality lip-synced videos is only the first step. The real challenge lies in delivering these videos to end users in real time while ensuring smooth and low-latency playback.
In the past, the WebSocket + MainSource solution was the mainstream choice for real-time video streaming. This approach maintained a persistent connection to push lip-synced video from the server to the client, where it was displayed in a front-end player. However, as user demands for real-time performance and smooth playback increased, the limitations of this approach became apparent—high latency, inefficient bandwidth usage, and synchronization difficulties, all of which significantly impacted user experience.
As a result, WebRTC (Web Real-Time Communication) technology emerged as a more efficient and stable alternative. Designed specifically for real-time audio and video communication, WebRTC enables low-latency, high-bandwidth efficiency in video transmission, making it especially suitable for streaming pre-generated lip-synced videos. With built-in audio-video synchronization mechanisms and automatic bandwidth management, WebRTC significantly improves the quality and stability of video streaming.
This article will delve into the transition from WebSocket + MainSource to the WebRTC video streaming solution, exploring how this upgrade brings a transformative improvement to real-time video streaming in digital human systems and analyzing its advantages and value in practical applications.
2. The Merits and Limitations of the WebSocket + MainSource Solution
2.1 How WebSocket + MainSource Works
In the early days of real-time audio and video transmission, WebSocket + MainSource was the undisputed “workhorse”. Its working principle is straightforward and effective:
- WebSocket: As a full-duplex communication protocol, WebSocket establishes a persistent connection between the client and server, allowing real-time bidirectional data transmission. This persistent connection eliminates the overhead of frequent HTTP request setup and teardown, significantly improving data transfer efficiency.
- MainSource: This is the audio-video data stream received by the front-end. It is transmitted from the server to the client via WebSocket and then displayed in a player in real-time. The core role of MainSource is to integrate and deliver audio-video streams to the front end, ensuring that users can see and hear the digital human’s performance in real-time.
2.2 Limitations: When Technology Hits a Bottleneck
Although the WebSocket + MainSource solution performed well in its early days, its limitations became increasingly apparent as digital human technology advanced:
- Latency Issues: WebSocket was not designed for audio-video streaming and lacks dedicated encoding, decoding, and bandwidth control mechanisms. As a result, in high-frequency audio-video data transmission, latency and packet loss are common, leading to lip-sync mismatches. For example, in a virtual livestream scenario, users may notice a delay between the voice and mouth movements, negatively impacting the viewing experience.
- Low Bandwidth Efficiency: Audio and video streaming requires high bandwidth, but WebSocket does not optimize for bandwidth usage, leading to network congestion, video stuttering, and data loss. The issue becomes even more severe when transmitting high-resolution videos, as the increased bandwidth consumption exacerbates performance problems.
- Synchronization Challenges: Audio and video frames are typically transmitted separately, requiring manual synchronization logic on the developer’s end. This process is complex and error-prone. Developers must precisely manage timestamps for both audio and video frames to ensure proper playback alignment. The additional development workload not only increases project complexity but also introduces potential synchronization errors.
- Poor Network Adaptability: In complex network environments (e.g., behind NATs or firewalls), WebSocket lacks robust penetration mechanisms, making connections unstable. For instance, in corporate intranets or public Wi-Fi networks, WebSocket connections may fail to establish, directly affecting audio-video stream transmission.
In real-world applications, these issues become especially problematic when transmitting high-frequency video streams, especially under poor network conditions, severely degrading the user experience.
3. The Rise of WebRTC—A New Benchmark for Real-Time Audio-Video Transmission
3.1 Why Choose WebRTC?
As real-time audio-video generation technology evolves, WebRTC (Web Real-Time Communication) has gradually replaced the WebSocket + MainSource solution, thanks to its dedicated design for audio-video communication. Here are the key advantages of WebRTC:
- Low Latency & Real-Time Performance
WebRTC is built with optimized encoding, bandwidth adaptation, and stream control mechanisms, significantly reducing transmission latency. This makes it ideal for real-time interactive scenarios. For example, in live interactions, WebRTC ensures seamless and real-time audio-video transmission, delivering a smooth user experience. - Automatic Bandwidth Management
WebRTC dynamically adjusts audio-video quality based on network conditions. For instance, in poor network conditions, it automatically lowers video resolution to prioritize smooth audio transmission. This adaptive mechanism not only improves bandwidth efficiency but also enhances streaming stability. - Built-In Audio-Video Synchronization
WebRTC handles audio-video synchronization automatically using timestamp mechanisms, eliminating the need for manual implementation. This significantly reduces development complexity. For example, in digital human applications, WebRTC automatically aligns audio and video frames, ensuring perfect lip-sync precision. - Robust Network Adaptability
WebRTC supports STUN and TURN servers, enabling seamless NAT and firewall traversal, ensuring a stable connection across various network environments. For instance, in corporate intranets or public Wi-Fi networks, WebRTC can retrieve a device’s public IP via STUN servers and relay traffic through TURN servers, guaranteeing stable audio-video streaming.
3.2 WebRTC’s Architecture: A Three-Layer Structure

3.2.1 The overall architecture of WebRTC
WebRTC is divided into three layers from top to bottom:
- WebAPI Layer (Top Layer): This is the interface exposed to developers for building WebRTC applications. It consists of JavaScript APIs, allowing developers to easily integrate real-time audio and video communication into their web applications.
- Core WebRTC Layer (Middle Layer – The Most Critical Part): This layer contains the three fundamental modules that power WebRTC, including Audio Engine (VoiceEngine), Video Engine (VideoEngine), Network Transport (Transport).
- Hardware & System Layer (Bottom Layer): Developed independently by different vendors, this layer handles audio-video capture and network I/O for seamless integration with hardware.
3.2.2 Core Components of WebRTC
1) The Voice Engine, which is responsible for the audio communication in WebRTC, provides a complete audio processing framework that allows audio to be read from an external device such as a microphone and then transmitted over the network. It is mainly divided into two modules: audio codec and speech signal processing. The key technologies are AcousticEchoCancceler (AEC) and NoiseReduction (NR). Echo cancellation eliminates unwanted echo r or prevents their occurrence for clearer sound. Noise Reduction removes background noise, improving speech clarity. iSAC and iLBC are used as the main audio codecs. iLBC (Internet Low Bitrate Codec) is a narrowband audio codec, suitable for voice communication over IP.
2) The Video Engine handles WebRTC video processing and communication, from camera input to network transmission and display. It consists of video codec and image processing. For video image codec, using VP8 as the default codec, which is more suitable for real-time communication. For video image processing, two techniques are adopted to ensure the high quality and aesthetic of the images. Firstly, a video jitter buffer is utilized to reduces visual distortions caused by network jitter and packet loss. Furthermore, adjusting color balance, noise reduction, and sharpness for better visual quality.
3) Transport is responsible for secure and efficient transmission of audio-video data. It features:
- Secure Encryption: Uses SRTP (Secure Real-time Transport Protocol) for encrypted transmission, preventing unauthorized access.
- Firewall & NAT Traversal: Implements ICE (Interactive Connectivity Establishment), integrating STUN and TURN servers to overcome NAT and firewall restrictions.
3.3 Key Concepts of WebRTC Video Streaming
To better understand WebRTC’s real-time communication capabilities, let’s explore some of its core concepts and components:
1) RTCPeerConnection (Peer-to-Peer Connection)
RTCPeerConnection is used to establish a peer-to-peer real-time communication connection. It enables the transmission of audio, video, and data streams between different browsers. The two ends initiating WebRTC communication are called peers, or Peer. Peer-to-peer communication means that two clients are directly connected, and data transmission does not require an intermediate server. A successfully established connection is called a PeerConnection, and a WebRTC communication can include multiple PeerConnections.
2) ICE (Interactive Connectivity Establishment)
ICE is not a protocol, but a framework that integrates STUN and TURN to help WebRTC traverse NAT and firewalls. It gathers network candidates, including IP addresses, ports, and protocols and prioritizes the best possible connection.
3) STUN (Session Traversal Utilities for NAT)
STUN helps devices discover their public IP addresses and ports, allowing them to communicate even when behind a NAT (Network Address Translator). STUN is mainly used to discovers and shares network information. It does not relay data, but only assists in connection setup.
4) TURN (Traversal Using Relays around NAT)
TURN is used when a direct connection fails, acting as a relay server to forward WebRTC media streams. N is designed for use in complex NAT environments and ensures that data can be transferred even when all direct connections are unavailable.
5) Stream (MediaStream)
A MediaStream is a collection of audio, video, or other media tracks that WebRTC transmits. For example, in a video call, a MediaStream might contain both an audio track (microphone) and a video track (camera).
6) Track (MediaStreamTrack)
A MediaStreamTrack represents an individual audio or video source within a MediaStream. Each MediaStream consists of at least one track and even multiple tracks. The audio track contains the audio source while the video track contains the video source.
7) Channel (RTP Channel)
In WebRTC, a channel usually refers to a real-time Transport Protocol (RTP) channel, which is used to transfer audio and video data in Real time. Each MediaStreamTrack can use an independent RTP channel for transmission.
8) Source (Media Source)
In WebRTC, MediaStreamTrack objects are often referred to as “tracks,” and a MediaStreamTrack data source is a “source.” A MediaStreamTrack can be an audio track (AudioTrack) or a video Track (VideoTrack), which represent an input source for audio or video, respectively.
9) Sink (Media Receiver)
In WebRTC, the sink is usually the endpoint used to receive the media stream (local/remote audio/video rendering). On the consumer side of the MediaStream, we can use the MediaStreamTrack object to create a MediaStreamTrack sink that will receive the media streams from the Source.
3.4 WebRTC Connection Establishment Process

WebRTC relies on SDP (Session Description Protocol) and ICE (Interactive Connectivity Establishment) to set up a peer-to-peer (P2P) media connection. Here’s how it works step by step:
1️) Sending an SDP Offer
- When a WebRTC peer (e.g., Peer A) wants to initiate communication, it will generates an SDP Offer containing details about the media session, including media types, codecs and network parameters.
- Peer A will send this SDP offer to a signaling server, which then forwards it to another peer (Peer B).
2️) Receiving the SDP Offer & Sending an SDP Answer
- Peer B, upon receiving the SDP Offer, parses its contents and generates an SDP Answer, which contains Peer B’s response information to the session.
- Peer B sends the SDP Answer back to the signaling server, which relays it to Peer A.
3️) ICE Candidate Discovery
- After the exchange of SDP Offer and SDP Answer, each peer starts to generate ICE Candidates. ICE Candidates include possible network paths for traversing Nats and firewalls.
- Each peer will send its ICE Candidates to each other through the signaling server.
- The signaling server forwards ICE Candidates from a peer to a peer on the other side.
4️) Connectivity Testing & Connection Establishment
- Once both parties exchange ICE Candidates, each peer performs connectivity checks to test the validity of these candidates.
- Based on the test results, both parties select the best candidate pair to establish peer-to-peer connection.
- Once a connection is established, media streams (e.g. audio, video) can flow directly between the peers.
4. From WebSocket to WebRTC: A Practical Technical Upgrade
4.1 Frontend Implementation: Establishing a WebRTC Connection
On the frontend, WebSocket is used to establish a signaling channel with the backend, enabling the exchange of WebRTC offers, answers, and ICE candidates. The RTCPeerConnection API is then utilized to set up a peer-to-peer connection. MediaStream is used to render the video stream while ensuring audio-video synchronization. In digital human applications, RTCPeerConnection is used to receive MuseTalk-generated lip-synced video streams and display them in the player. This low-latency connection ensures that the digital human’s lip movements match the audio in real-time, significantly enhancing user experience.
4.2 Backend Implementation: Video Merging & Transmission
On the backend, WebSocket handles requests from the frontend and generates a WebRTC offer, which is then sent to the client. Meanwhile, the VideoStreamMerger component is used to merge multiple video streams. The merged stream is then transmitted to the frontend via RTCPeerConnection. In Digital Human use cases, the backend can combine multiple camera feeds or pre-synchronized lip-sync video streams before transmitting them via WebRTC. This method optimizes video processing efficiency while ensuring synchronized multi-stream playback, supporting complex digital human applications.
4.3 Video Stream Management: Dynamic Loading & Continuous Playback
In order to ensure the continuous playback of the video stream, the backend will dynamically load the video files from a specified folder. Once a video finishes playing, the next video is automatically loaded and played, eliminating the need for manual intervention. When applied in Digital Humans, this mechanism is ideal for scenarios requiring continuous video playback, such as virtual customer service or AI-driven education assistants. The backend can dynamically load pre-recorded responses or instructional videos and stream them in real-time via WebRTC. This ensures a seamless user experience, where the digital human can engage in smooth, uninterrupted conversations.
5. Practical Benefits of the Technical Upgrade
1️) Lower Latency: WebRTC’s built-in optimizations ensure real-time audio and video transmission, significantly reducing latency. In digital human applications, this low-latency transmission ensures that the virtual character’s lip movements are precisely synchronized with the audio, eliminating perceptible time lags and providing a more realistic interactive experience.
2️) Higher Bandwidth Efficiency: WebRTC dynamically adjusts audio-video quality based on network conditions, preventing bandwidth waste and congestion. In digital human scenarios, this ensures smooth video streaming, even under poor network conditions, enhancing user experience stability.
3️) Automatic Audio-Video Synchronization: WebRTC’s native synchronization mechanism eliminates audio-video desynchronization issues, removing the need for complex manual synchronization logic. In digital human applications, this ensures frame-accurate lip-syncing, significantly improving synchronization accuracy and naturalness.
4️) Stronger Network Adaptability: WebRTC leverages STUN and TURN servers to traverse NAT and firewalls, maintaining stable connections in diverse network environments. In digital human applications, this guarantees reliable video streaming, even in corporate intranets, public Wi-Fi, and other challenging network conditions, expanding the applicability of digital human technology.
6. Conclusion: Technology Drives Experience Enhancement
The transition from WebSocket + mainSource to WebRTC video streaming marks a major milestone in the evolution of digital human technology. This technical upgrade not only resolves key challenges in audio-video transmission but also lays a solid foundation for future advancements.Innovation in 5G, AI, and edge computing will continue to push digital human technology forward, unlocking new possibilities across industries. Our mission remains unchanged—to deliver a more natural, immersive digital human experience.With continuous technological breakthroughs, digital humans will redefine interactive experiences, bringing seamless and intelligent engagement to users worldwide.
Thanks for follow us with the real-time digital human series! Additionally, a video version of this blog is available below! If you’re interested in the implementation details of MuseTalk or OpenAI Realtime API, or have any questions, feel free to leave a comment.
And welcome to explore my Youtube channel https://www.youtube.com/@frankfu007 for more exciting content. If you enjoy my video, don’t forget to like and subscribe for more insights!
Interested in more application of Realtime API in practical projects? Explore my articles linked below!
- Education Nano 01 – Modular Wheel-Leg Robot for STEM
- Audio-Visual Synchronization Algorithms in Digital Humans and the TIME_WAIT Challenge in WebSocket Communication
- Building a Voice-Controlled Robot Using OpenAI Realtime API: A Full-Link Implementation from RDK X5 to ES02
- Desktop Balancing Bot(ES02)-Dual-Wheel Legged Robot with High-Performance Algorithm
- Wheeled-Legged Robot ES01: Building with ESP32 & SimpleFOC
2 responses
I’m not that much of a internet reader to be honest but
your sites really nice, keep it up! I’ll go ahead and bookmark your site to come back later on. CheersThis piece lingers in the mind, sparking thoughts that stay with you long after you’ve finished reading.
Leave a Reply