1. Introduction: The Dual Challenge of Building a “Real-Time” System

In the current wave of AI applications, digital human systems are rapidly evolving from “content generation tools” into “interactive intelligent agents.” Whether it’s virtual hosts, AI tour guides, digital customer service representatives, or companion robots, expectations have shifted from passive output to delivering interactive experiences marked by realism, continuity, and stability.

Achieving this sense of “real-time presence” goes far beyond voice synthesis, animation pipelines, or model capability. From a system engineering perspective, a successful digital human platform must resolve two core challenges:

▪️First, how to speak with precision—that is, accurate synchronization between audio and video. When the digital human speaks, its lip movements must stay aligned with the rhythm of the audio. Delays, frame mismatches, or silent visuals can all break immersion. Only when speech and facial expression are coordinated can users truly feel, “she’s really talking to me.”

▪️Second, how to stay reliably alive—that is, maintain stable and dependable connections. In high-concurrency or unstable network environments, frequent disconnections and reconnections not only degrade the user experience but also risk exhausting system resources. In particular, when using WebSocket as the real-time signaling layer, poor TCP TIME_WAIT handling can rapidly deplete available connection ports, obstructing the creation of new sessions.

These two challenges affect different but equally critical layers: one determines the perceived authenticity of the experience, the other defines the operational sustainability of the system.

In our real-world projects, both of these issues have undergone extensive debugging and architectural revisions. We started by aligning disparate audio and video timelines, leveraging the WebRTC protocol stack and Python’s asynchronous scheduling to build a frame-level synchronized media delivery model. Concurrently, we analyzed the TCP state mechanics at the WebSocket layer to optimize connection handling and mitigate TIME_WAIT–induced bottlenecks.

This blog post focuses on those two fronts: (1) how to implement frame-level control to achieve WebRTC audio-video synchronization; and (2) how to optimize WebSocket connection management under high concurrency to avoid TIME_WAIT pitfalls.

It is both a technical breakdown and a reflection of our experience building the communication backbone of a digital human system. We hope it provides insight and practical value to those developing real-time AI interaction platforms.

2. Precise Synchronization: The Mechanism Behind Audio-Visual Coordination

In digital human systems, audio-video synchronization isn’t just about quality—it’s what makes the interaction feel human. If the visuals and audio fall out of sync, even with accurate speech and natural tone, the illusion breaks. This section dives into a real-time synchronization mechanism built with WebRTC and Python, from system architecture to coroutine scheduling, unpacking how we enable lip-synced, frame-accurate output.

2.1 Why Synchronization Is Hard: Technical Asymmetry at Its Core

While syncing audio and video may sound simple in theory—just match the mouth to the voice—it comes with a series of nontrivial engineering challenges:

▪️Characteristics of audio data: Typically lightweight, streamable, with a high sampling rate (e.g., 24kHz). Audio is pushed every 20ms in fine-grained chunks, making it more forgiving and reactive.

▪️Characteristics of video frames: Large in size, computationally heavier, and frame-limited (e.g., 30FPS). It’s more prone to delays or dropped frames due to processing and network constraints.

▪️The complexity of time alignment: Audio and video streams operate on different clocks and sampling bases. To make them appear synchronous, the system must enforce a unified timing reference.

These discrepancies mean that frame-level synchronization isn’t just a matter of timing—it requires deliberate design across the architecture: building a shared timeline, coordinating tasks precisely, and handling variance with resilience.

2.2 System Architecture and Core Components

To enable low-latency, high-precision transmission, the system employs the following core architecture:

▪️Media Transport Layer: Real-time delivery is handled through WebRTC, ensuring minimal transmission delay.

▪️Track Implementation via aiortc: Custom AudioStreamTrack and VideoStreamTrack classes are defined to control media flow manually.

▪️Stream Design:

▪️SingleFrameVideoStreamTrack: On each recv() call, it returns the current frame, assigning a PTS (presentation timestamp) based on time.monotonic()—using a 90kHz clock base to align with WebRTC standards.

▪️SingleFrameAudioStreamTrack: Also polled via recv(), this stream sends 20ms PCM audio blocks, accumulating the count of audio samples sent to calculate timing.

Both tracks are launched using a unified media dispatch task, push_av_segment(), and they share a common start time as their reference anchor. This design enables synchronized coroutine scheduling across audio and video, ensuring tightly coupled media playback in real-time. The detailed construction is as follows:

2.2.1 SingleFrameVideoStreamTrack: Frame-Level Controlled Video Stream

class SingleFrameVideoStreamTrack(VideoStreamTrack):
    def __init__(self, frame=None, fps=30):
        super().__init__()
        self._current_frame = frame if frame is not None else np.zeros((2160, 3840, 3), dtype=np.uint8)
        self._start_time = time.monotonic()
        self._time_base = fractions.Fraction(1, 90000)  # 90kHz timebase
        self._fps = fps

    async def recv(self):
        elapsed = time.monotonic() - self._start_time
        pts = int(elapsed * 90000)  # Timestamp in 90kHz
        if isinstance(self._current_frame, VideoFrame):
            frame = self._current_frame
        else:
            frame = VideoFrame.from_ndarray(self._current_frame, format="bgr24")
        frame.pts = pts
        frame.time_base = self._time_base
        return frame

    async def update_frame(self, new_frame):
        if isinstance(new_frame, VideoFrame):
            arr = new_frame.to_ndarray(format="bgr24")
        else:
            arr = new_frame
        self._current_frame = arr

Key Characteristics:

▪️The recv() method is the core interface through which WebRTC pulls the next video frame. Each call returns a single frame.

▪️Elapsed time is calculated using time.monotonic() and multiplied by 90,000 to generate a WebRTC-compatible PTS (Presentation Timestamp).

▪️The current frame is updated externally via update_frame(), which sets the next frame that recv() will return.

▪️time_base is set to 1/90000, a standard time unit used in video protocols for precision.

▪️Since frames are passively pulled rather than actively pushed, playback timing is ultimately governed by WebRTC’s internal rhythm, not the application itself.

2.2.2 SingleFrameAudioStreamTrack: Finely Sliced Audio Stream

class SingleFrameAudioStreamTrack(AudioStreamTrack):
    kind = "audio"

    def __init__(self, sample_rate=24000, channels=1):
        super().__init__()
        self.sample_rate = sample_rate
        self.channels = channels
        self._time_base = fractions.Fraction(1, sample_rate)
        self.audio_queue = deque(maxlen=100)
        self._samples_sent = 0

    async def recv(self):
        if self.readyState != "live":
            raise MediaStreamError

        while not self.audio_queue:
            await asyncio.sleep(0.001)

        pcm = self.audio_queue.popleft()
        samples = pcm.shape[0]

        frame = AudioFrame(format="s16", layout="mono", samples=samples)
        frame.sample_rate = self.sample_rate
        frame.time_base = self._time_base
        frame.planes[0].update(pcm.tobytes())

        frame.pts = self._samples_sent
        self._samples_sent += samples
        return frame

    def push_audio_data(self, pcm_int16: np.ndarray):
        self.audio_queue.append(pcm_int16)

Key Characteristics:

▪️A deque-based buffer holds PCM audio chunks, each representing 20ms of sound.

▪️On each recv() call, one audio block is dequeued, its sample count calculated, and a corresponding PTS assigned.

▪️pts = self._samples_sent, which tracks the cumulative number of samples sent since the stream began.

▪️time_base is set to 1/24000, matching the 24kHz audio sampling rate.

▪️New audio blocks are continually pushed into the queue via push_audio_data(pcm), typically called within an audio_task() coroutine.

▪️Unlike video, audio is constantly pulled by WebRTC, so the system must maintain a non-empty buffer; otherwise, it risks an audio underrun.

2.3 Timeline Alignment Mechanism

In media synchronization, the core is not about “matching frame rates” but about a shared start time with independently paced streams. This alignment is implemented through the push_av_segment() method.

2.3.1 Unified Time Anchor: start_time = time.monotonic()

start_time = time.monotonic()

The system captures a unified starting reference using Python’s time.monotonic(), which provides a steadily increasing clock immune to system time changes. This start_time serves as the shared baseline for both the audio and video coroutines, ensuring both streams “start running from the same moment.”

2.3.2 Audio Coroutine: Pushing One Audio Chunk Precisely Every 20ms

sample_rate = 24000
audio_chunk_duration = 0.02  # 20 ms
chunk_samples = int(sample_rate * audio_chunk_duration)

async def audio_task():
    pos = 0
    idx = 0
    while pos < total_samples and not share_state.in_break and not share_state.should_stop:
        # Calculate when to push the next audio chunk
        target = start_time + idx * audio_chunk_duration
        now = time.monotonic()
        if now < target:
            await asyncio.sleep(target - now)

        # Slice and normalize PCM samples
        end = min(pos + chunk_samples, total_samples)
        block = waveform[pos:end]
        if len(block) < chunk_samples:
            block = np.pad(block, (0, chunk_samples - len(block)))
        pcm = (block * 32767).astype(np.int16)  # Convert float32 to int16 PCM

        audio_track.push_audio_data(pcm)  # Send to aiortc AudioStreamTrack

        pos = end
        idx += 1

Key Points:

▪️Audio data is pushed every 20 milliseconds to ensure precise temporal pacing.

▪️The audio waveform, originally in float format, is normalized and converted into 16-bit PCM integers.

▪️Each chunk is delivered using audio_track.push_audio_data(pcm), feeding the audio queue consumed by recv().

▪️Time alignment is achieved via target = start_time + idx * 0.02, precisely calculating when each chunk should be played.

2.3.3 Video Coroutine: Frame-by-Frame Updates with Accurate PTS Stamping

fps = 30  # Video frame rate

async def video_task():
    for i in range(frame_count):
        if share_state.in_break or share_state.should_stop:
            break

        # Calculate when to show the i-th frame
        target = start_time + i / fps
        now = time.monotonic()
        if now < target:
            await asyncio.sleep(target - now)

        img = frames[i]
        vf = VideoFrame.from_ndarray(img, format="bgr24")

        # Stamp pts using elapsed time in 90kHz units
        t_sec = time.monotonic() - start_time
        vf.pts = int(t_sec * 90000)
        vf.time_base = fractions.Fraction(1, 90000)

        await track.update_frame(vf)

Key Points:

▪️Each video frame is withheld until its scheduled display time arrives.

▪️The elapsed time since start_time is calculated via time.monotonic().

▪️The frame’s presentation timestamp (PTS) is then derived as pts = t_sec * 90000, conforming to WebRTC’s 90kHz timing convention.

▪️update_frame(vf) injects the frame into the video track, allowing recv() to deliver it on demand.

2.3.4 Parallel Execution: Asynchronous Launch of Audio and Video Coroutines

task_a = asyncio.create_task(audio_task())
task_v = asyncio.create_task(video_task())
await task_a
task_v.cancel()  # Stop video task once audio ends

Explanation:

Audio and video run on separate asynchronous tasks, operating independently and without blocking each other.

Once the audio stream completes, the system explicitly cancels the video task to keep the overall session aligned.

This coroutine-based design provides resilient and fine-grained control over frame-level synchronization.

2.3.5 Summary: Core Principles of the Timeline Alignment Mechanism

3. In-Depth Look at WebSocket and the TIME_WAIT State

If audio-video synchronization is the technical foundation for “speaking precisely,” then connection management is the systemic foundation for “staying alive.” In real-world deployments, we’ve identified a frequently overlooked issue in real-time systems: the accumulation of TIME_WAIT states, which significantly undermines the stability of WebSocket connections. This chapter addresses the problem in depth—beginning with TCP fundamentals, then analyzing the risks WebSocket faces under high concurrency, and concluding with practical mitigation strategies at both system and architectural levels.

3.1 The Intended Role of TIME_WAIT

TIME_WAIT is not a failure state—it’s an essential safety mechanism built into the TCP protocol. Its origin lies in the standard connection termination process, known as the TCP four-way handshake:

1. The client sends a FIN, indicating it will no longer transmit data;

2. The server replies with an ACK, confirming receipt;

3. The server then sends its own FIN, signaling that it, too, is done;

4. Finally, the client sends an ACK, formally closing the connection.

After this, the client enters TIME_WAIT, remaining there for 2×MSL (Maximum Segment Lifetime)—typically 30 to 120 seconds. This ensures the final ACK is reliably delivered and that no stray packets from this connection interfere with future sessions.

TIME_WAIT is designed to:

▪️Prevent stale packets from being misinterpreted by new connections;

▪️Guarantee that the remote peer receives the final ACK;

▪️Avoid port collisions that could cause session ambiguity.

▪️From a protocol design perspective, TIME_WAIT is a deliberate safeguard, not a flaw. But in WebSocket scenarios that involve frequent reconnects, this same mechanism becomes a critical bottleneck.

3.2 How TIME_WAIT Accumulates in WebSocket Use

3.2.1 Problem 1: Frequent Disconnects Cause TIME_WAIT Overload

In many typical development or runtime workflows, TIME_WAIT states can pile up rapidly:

▪️During debugging, connections are repeatedly opened and torn down;

▪️Malfunctioning heartbeat mechanisms cause frequent disconnect-reconnect cycles;

▪️Server or middleware crashes trigger aggressive client reconnection attempts;

▪️Clients proactively reconnect on intervals to avoid idle disconnections.

Since WebSocket operates atop TCP and clients usually initiate the handshake, TIME_WAIT states tend to accumulate on the client side or at the API gateway layer.

3.2.2 Problem 2: High Concurrency Consumes All Available Ports

When TIME_WAIT states grow unchecked, they cause:

▪️Rapid exhaustion of local ephemeral ports (limited to 65,535);

▪️connect() calls fail with “Address already in use” errors;

▪️New connections can’t be established, causing business-level request failures;

▪️Systems exhibit hard-to-reproduce symptoms like “intermittent dropouts” or “clients unable to connect.”

3.3 Mitigation Strategies: From Kernel to Architecture

To alleviate these issues, solutions are available at both the operating system and socket levels.

3.3.1 Kernel-Level Tuning (Linux)

Linux systems can be tuned via sysctl to reduce the negative impact of TIME_WAIT:

# Enable TCP TIME-WAIT socket reuse (Recommended)
net.ipv4.tcp_tw_reuse = 1

# Reduce TIME-WAIT timeout (Moderate adjustment)
net.ipv4.tcp_fin_timeout = 30

# TCP Keepalive settings (Prevent zombie connections)
net.ipv4.tcp_keepalive_time = 600    # Start keepalive probes after 600s idle
net.ipv4.tcp_keepalive_intvl = 30    # Interval between keepalive probes
net.ipv4.tcp_keepalive_probes = 3    # Number of probes before declaring dead

In Windows, the following command can be used to reduce the TIME_WAIT duration to 30 seconds: netsh int tcp set global TcpTimedWaitDelay=30 .

This adjusts the system-wide TIME_WAIT timeout value to help reclaim local ports more aggressively in high-traffic environments.

3.3.2 Socket-Level Configuration Recommendations

For both WebSocket server and client implementations, the following socket options are highly recommended:

sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEPORT, 1)

▪️SO_REUSEADDR: Allows reuse of local addresses even if they are still in the TIME_WAIT state, facilitating quicker socket binding.

▪️SO_REUSEPORT: Enables multiple processes or threads to bind to the same port simultaneously, improving scalability and load distribution.

4. Conclusion: A Stable System Is the Foundation of Experience

In building digital human systems, whether or not “the lips move with the voice” may sound like a UX detail—but in truth, it reflects the maturity of the underlying technical stack. And whether or not the system can “stay online” is a litmus test for the resilience of its engineering foundation.

Audio-video synchronization and connection management, though seemingly separate, each determine a core dimension:

▪️The realism of the experience — Only when lip movement and speech align do users believe they’re talking to a real “person.”

▪️The system’s sustainability — Only when connections are steady and uninterrupted does the system feel truly “alive.”

A digital human is an interaction modality that demands real-time fidelity. You may think you’re building AI, but you’re actually managing live media. You may think you’re coding WebSocket, but you’re crafting a fault-tolerant interface on the edge of operating systems and network protocols.

Stability is not a bonus—it is the precondition for trust. It is the bedrock upon which a believable experience is built.

Frank Fu's Blog

Latest Posts

Building NavBot-D1: From Parts, Jetson, and ROS 2 to Reinforcement-Learning Locomotion

Building a Desktop AI Companion with RDK X5, OpenClaw, NavTalk, and MQTT

NavTalk Official Support for NVIDIA RTX 5090 on Linux

Understanding Reinforcement Learning through OpenDuck

NavTalk Digital Human Loop Video Generation Technical Implementation

Complete Guide to Deploying MIT Mini Cheetah on D-Robotics RDK S100

NavTalk Product Update: Five Core Features Comprehensive Upgrade

NavTalk Update: Revolutionary 200ms Response Time for Real-Time Digital Human Experience!

Building Real-time Voice Conversations with ElevenLabs WebSocket API: A Complete Development Guide

NVIDIA Jetson Orin Nano Super Developer Kits – Build MIT Mini Cheetah Robot

Audio-Visual Synchronization Algorithms in Digital Humans and the TIME_WAIT Challenge in WebSocket Communication