Digital Human Tech Upgrade: Performance Breakthrough with Python & WebRTC

In our project, we successfully migrated the audio-video processing functionality from Java to Python. This transformation significantly optimized the entire system architecture, particularly in video streaming. One of the most crucial improvements was transitioning from the WebSocket protocol to WebRTC, achieving more efficient, low-latency audio-video data transmission. Additionally, by loading all audio and image data directly into memory, we eliminated disk I/O operations, marking a revolutionary improvement for systems requiring high real-time performance.

1. Framework Refactoring

1.1 Challenges and Solutions in Java to Python Migration

During the migration of audio-video processing from Java to Python, we encountered several technical challenges, particularly regarding performance optimization and library support differences. While Java offers robust support for audio-video processing, we chose Python for its greater flexibility in handling high concurrency, data processing, and integration with WebRTC. Python’s coroutines and asynchronous programming model (asyncio) significantly simplified our real-time data stream management, especially for meeting real-time requirements in audio-video processing.

During migration, we focused particularly on how to manage audio and video data processing more efficiently in Python. By loading all data into memory to avoid disk I/O operations, we reduced latency from filesystem interactions. Although Python’s multithreading and multiprocessing support differs slightly from Java’s, proper thread pool and asynchronous programming designs ensured full system performance.

1.2 WebRTC vs WebSocket: Why Choose WebRTC?

WebSocket was initially designed for real-time communication and was widely used in our previous audio-video streaming architecture. However, as audio-video data volume and real-time requirements increased, we observed that WebSocket introduced significant latency during audio-video streaming, particularly in poor network conditions where it couldn’t guarantee real-time performance or multi-stream synchronization.

Compared to WebSocket, WebRTC is specifically designed for low-latency, high-quality audio-video communication, maintaining stable connections even in complex network conditions. WebRTC incorporates built-in fault tolerance mechanisms for network jitter, packet loss, and latency, supporting more efficient data transmission – making it ideal for real-time audio-video applications. Therefore, we decided to replace WebSocket with WebRTC to ensure smooth audio-video streaming with minimal latency.

WebRTC’s protocol stack consists of three core protocols: STUN, TURN, and ICE. STUN handles NAT traversal, TURN relays data through intermediary servers, while ICE helps WebRTC select optimal transmission paths to adapt to different network environments.

2. Real-Time Audio-Video Processing

WebRTC uses VideoStreamTrack and AudioStreamTrack objects to handle audio-video streams. These objects are responsible for receiving and sending video frames and audio data respectively. The RTCPeerConnection encapsulates them in sessions to handle media negotiation, encoding, decoding, and other tasks.

Audio-Video Stream Transmission and Reception: In Python, we implemented video and audio streaming through VideoStreamTrack and AudioStreamTrack. These streams are managed by RTCPeerConnection objects to ensure temporally ordered data delivery and synchronization.
Negotiation and Data Exchange: In WebRTC connections, two endpoints need to exchange media negotiation information via SDP (Session Description Protocol). This allows WebRTC to determine parameters like codecs, resolution, and frame rates for both parties.

2.1 Audio Processing

Real-time audio processing is equally critical. To ensure synchronized audio-video transmission, we used the SingleFrameAudioStreamTrack class to manage audio streams. Whenever new audio data is received, audio frames are added to a queue and pushed in real-time via the recv() method.

Unlike traditional disk storage, our audio data is transmitted entirely through memory. Audio data is segmented and pushed every 20 milliseconds, ensuring efficient transmission and low latency. Furthermore, audio frame synchronization with video frames is managed through timestamps, guaranteeing synchronized playback.

Below is the code implementation for audio data pushing:

class SingleFrameAudioStreamTrack(AudioStreamTrack):
    async def recv(self):
        while not self.audio_queue:  # Wait until there's audio data in the queue
            await asyncio.sleep(0.005)  # Sleep briefly and retry if no data

        audio_data = self.audio_queue.popleft()  # Get the next chunk of audio data
        samples = audio_data.shape[0]  # Get the number of samples in the audio data

        # Create an audio frame to send over the WebRTC stream
        frame = AudioFrame(format="s16", layout="mono", samples=samples)
        frame.sample_rate = self.sample_rate  # Set the sample rate
        frame.time_base = fractions.Fraction(1, self.sample_rate)  # Time base is set to sample rate

        # Add the audio data into the frame and update the frame
        frame.planes[0].update(audio_data.tobytes())  # Convert data to bytes and store in frame
        frame.pts = self._timestamp  # Set the timestamp (presentation time)
        self._timestamp += samples  # Increment timestamp for the next frame
        return frame  # Return the audio frame to be transmitted

2.2 Video Processing

For video processing, we use the SingleFrameVideoStreamTrack class to process video data frame by frame. The timestamp of each video frame is controlled by the frame rate, ensuring video continuity. During the pushing process, video frames are transmitted directly from memory to the WebRTC stream, avoiding the performance bottleneck of disk operations.

Audio-video synchronization is one of the core optimizations of this architecture. Whenever a video frame is pushed, we synchronously push audio data based on the video frame’s timestamp. Below is the key synchronization push code:

class SingleFrameVideoStreamTrack(VideoStreamTrack):
    async def recv(self):
        async with self._lock:  # Ensure thread-safe access to the frame
            if isinstance(self._current_frame, VideoFrame):
                frame = self._current_frame  # If current frame is a VideoFrame, use it
            else:
                # Otherwise, convert the current frame data (numpy array) to a VideoFrame
                frame = VideoFrame.from_ndarray(self._current_frame, format='bgr24')

            # Set the timestamp (PTS) for the video frame
            frame.pts = self._timestamp
            frame.time_base = self._time_base  # Time base for the frame

            # Increment timestamp based on the frame rate (30fps in this case)
            self._timestamp += 3300  # 30fps, so each frame's timestamp is incremented by 3300 units (33 ms per frame)

            return frame  # Return the video frame to be transmitted

2.3 Audio-Video Synchronization Implementation

Audio-video synchronization is a core issue that needs to be addressed when pushing audio and video data. In practical applications, the pushing of audio and video must be maintained within the same time window to ensure synchronized playback. In the code, we achieve audio-video synchronization through the following methods:

Synchronized Pushing of Audio and Video Data:
- In the push_av_segment() method, we push the corresponding audio data based on the video frame’s timestamp. The timestamp of each video frame guides us to push the audio data for the corresponding time segment.
- We use await asyncio.sleep(0.02) to control the interval between audio segment pushes, ensuring audio data is pushed every 20 milliseconds. This interval works in tandem with the video frame duration (33 milliseconds) to maintain synchronization.
Controlling the Audio Frame Push Rate:
- Based on the video frame’s timestamp and the length of the audio data, we dynamically adjust the amount of audio data pushed each time to prevent audio-video desynchronization caused by network latency or data loss.

Below is the complete code for synchronized audio-video pushing:

async def push_av_segment(segment_index):
    """Synchronously push audio and video segment"""

    try:
        frames = global_frame_map[segment_index]
        waveform = global_audio_frame_map[segment_index]
        sample_rate = 24000  # Audio sample rate (24kHz)
        fps = 33  # Frames per second for video (33fps)

        # Calculate the audio duration in seconds
        audio_duration = len(waveform) / sample_rate

        # Calculate the total number of video frames required for this audio duration
        video_frame_count = min(len(frames), int(audio_duration * fps))

        # Define chunk size for audio (20ms per chunk)
        chunk_samples = int(0.02 * sample_rate)  # 20ms audio chunk
        audio_pos = 0

        # Define frame duration (in seconds)
        frame_duration = 1 / fps

        start_time = time.time()  # Start timing to ensure accurate frame pacing

        # Loop through the video frames and sync with audio
        for frame_idx in range(video_frame_count):
            # Convert video frame to WebRTC format and update the track
            video_frame = VideoFrame.from_ndarray(frames[frame_idx], format='bgr24')
            await track.update_frame(video_frame)

            # Calculate the expected position for the corresponding audio frame
            expected_audio_pos = int(frame_idx * frame_duration * sample_rate)

            # Push the corresponding audio chunks while the audio position is less than the expected position
            while audio_pos < expected_audio_pos and audio_pos < len(waveform):
                chunk_end = min(audio_pos + chunk_samples, len(waveform))
                chunk = waveform[audio_pos:chunk_end]

                # If the chunk size is smaller than expected, pad it to ensure consistency
                if len(chunk) < chunk_samples:
                    chunk = np.pad(chunk, (0, chunk_samples - len(chunk)))

                # Push the audio data (converted to int16 format) to the audio track
                audio_track.push_audio_data((chunk * 32767).astype(np.int16).reshape(-1, 1))
                audio_pos = chunk_end

                # Sleep to maintain audio frame pacing (20ms delay)
                await asyncio.sleep(0.02)

            # Control video frame rate by comparing elapsed time with expected frame time
            elapsed = time.time() - start_time
            expected_time = (frame_idx + 1) * frame_duration
            if elapsed < expected_time:
                await asyncio.sleep(expected_time - elapsed)

    except Exception as e:
        print(f"❌ Segment {segment_index} push failed: {str(e)}")

3. Python and WebRTC Collaboration

The advantage of the WebRTC protocol lies in its design specifically for real-time data transmission. We establish WebRTC connections via RTCPeerConnection and utilize VideoStreamTrack and AudioStreamTrack for sending and receiving audio/video streams. Below is the core WebRTC configuration code:

ice_servers = [RTCIceServer(
    urls="turn:freestun.net:3478",  # TURN server address
    username="free",  # Username
    credential="free"  # Password
)]
configuration = RTCConfiguration(ice_servers)
pc = RTCPeerConnection(configuration=configuration)

The WebRTC protocol stack includes STUN, TURN, and ICE, each serving the following purposes:

STUN (Session Traversal Utilities for NAT): Used to traverse NAT (Network Address Translation) and firewalls to determine the client’s public IP address.
TURN (Traversal Using Relays around NAT): When STUN fails to traverse firewalls or NAT, TURN acts as a relay server to forward packets, ensuring stable communication.
ICE (Interactive Connectivity Establishment): Through ICE, WebRTC can find the most optimal network path to optimize data transmission.

By configuring TURN servers, WebRTC can maintain stable operation even in complex network environments. The use of RTCPeerConnection enables smooth data transmission while ensuring audio-video synchronization.

4. System Optimization and Future Prospects

Other Performance Optimization Methods

After eliminating disk I/O operations, our system has significantly improved real-time performance, but we can further enhance it. For example, using memory pools and efficient data structures to reduce memory allocation overhead, or employing asynchronous I/O to optimize network request processing speed. Additionally, parallel processing of multiple video streams or distributed processing can further reduce system load and improve response times.

Complexity of Audio-Video Synchronization

Audio-video synchronization is not just a simple timestamp-based operation. It also requires consideration of network latency, buffer management, data loss, and other factors. For instance, we may need to handle bursty video frame data through appropriate buffering mechanisms during the pushing process. More sophisticated algorithms can also be employed to adjust the rate of audio and video frames, ensuring smooth playback without data loss.

System Scalability

As audio-video processing demands grow, we need to consider how to make the system more scalable. Through horizontal scaling (e.g., load balancing, distributed deployment), we can support larger-scale concurrent audio-video stream processing. For large-scale real-time applications, cloud deployment and micro services architecture will help better manage and allocate resources, improving system stability and scalability.

Thanks for follow us with the real-time digital human series! Additionally, a video version of this blog is available below! If you’re interested in the implementation details of Digital Human Tech Upgrade, or have any questions, feel free to leave a comment.

And welcome to explore my Youtube channel https://www.youtube.com/@frankfu007 for more exciting content. If you enjoy my video, don’t forget to like and subscribe for more insights!