1. Introduction: Enabling Robots to Understand Human Language

Today, consumer-grade robots are no longer high-threshold research devices; they are increasingly being integrated into practical scenarios such as education, entertainment, interaction, and even companionship. Various development kits, including wheeled chassis, quadrupedal robots, and robotic arms, have emerged, significantly lowering the barrier at the hardware level.

However, challenges arise: How can we communicate more naturally with robots?

1.1 Limitations of Traditional Control Methods

Currently, the vast majority of robots still rely on the following control methods:

Each of these solutions has its merits and drawbacks, but they all face a common issue:

Human and robot communication remains a “language barrier”—robots cannot truly understand the meaning of human natural language.

What we desire is not to “press a button to make the robot move,” but rather to “speak a command, and the robot knows what to do.”

1.2 Why Voice + AI + Real-time?

“Getting robots to understand what you say” requires three key capabilities:

1. Accurately recognizing what the human has said (speech recognition).

2. Understanding the semantic meaning or intended action of the phrase (natural language understanding).

3. Translating the intention into executable action controls (hardware driving).

The challenges we face have been:

▪️Either low quality in speech recognition (especially for multilingual support).

▪️Inability to directly convert semantics into executable functions (lack of structured output).

▪️Excessive latency or complex deployment, making it unsuitable for embedded devices.

1.3 How Does This Project Break Through the Bottleneck?

In this project, we propose a complete closed-loop solution that integrates “speech recognition → action understanding → actual execution” into an automated process:

▪️Utilizing OpenAI’s latest Realtime API: to achieve real-time voice transmission and action function calls.

▪️Based on the RDK X5 development board: a compact Linux computing platform that supports audio collection, network connectivity, and running Python.

▪️Coupled with the ES02 gait robot: which can perform various actions like moving forward, turning, standing up, and squatting through serial control.

The entire system implements:

🎤 You say a phrase → 🧠 Cloud recognition and understanding → 🔧 Serial command issued → 🤖 Robot action executed.

No training or pre-setting of voice command vocabularies is required, it supports mixed Chinese and English, automatic multilingual recognition, low-latency interaction, and keeps hardware costs under $100, providing excellent cost performance.

1.4 This Article Will Guide You Through…

This project offers a fully reproducible voice robot solution; this article will delve into:

▪️Demonstration of the project’s actual effectiveness (real-time control in Chinese and English).

▪️Introduction to the overall system architecture and hardware/software modules.

▪️Reasons for development board selection and cost analysis.

▪️Every step from deployment to operation (no complex configurations needed).

▪️A complete code breakdown and explanation for developers.

Whether you are:

▪️Looking to create your own voice robot at a low cost.

▪️Interested in understanding the practical implementation of the OpenAI Realtime API.

▪️Or eager to learn about integrating embedded AI applications.

You can find relevant content in this project.

Quick Start Tip: If you want to jump right into the experience and skip the theoretical explanations and system analysis, please directly go to Chapter 4 “Deployment and Operation: From GitHub to Power Control,” where you will find the complete installation steps, dependency configurations, and operating methods.

2. Project Demonstration Effects and Basic Principles

Before delving into the technical solutions, let’s take a look at the effects achieved by this project. By integrating voice recognition, function calling, and real-time control, we have successfully implemented a robot voice control system that supports natural language interaction, recognizes both Chinese and English, and reacts in real time. Whether it’s “Turn left” or “Stand up”, the robot can accurately understand and execute corresponding actions, achieving a human-like interaction experience. The entire system requires no screen and no remote control; it relies solely on a spoken command.

2.1 Demonstration Effects

The core demonstration video of this project is as follows:

In the project video, we control the robot to complete the following key actions through voice commands:

From the demonstration, it is evident that the system not only achieves high accuracy in voice recognition but also parses semantics and correctly maps them to machine instructions. More importantly, all commands are spoken in natural language, entirely eliminating the need for a remote control or screen interaction. Additionally, it showcases the ability to switch languages within the same session, with the system still able to accurately recognize and respond, demonstrating its multilingual processing capability.

2.2 Overall System Flowchart

To achieve such a natural control experience, we have constructed an entire set of collaborative soft and hardware systems. Below is a comprehensive overview of the project’s technical process:

【Voice Input】 → 【RDK X5 captures audio】 → 【Send to OpenAI Realtime API】
                         ↓
       【OpenAI real-time transcription + intent recognition + function call】
                         ↓
       【Structured action command generated】 → 【Parsed by local Python module】
                         ↓
       【Control values generated for serial output】 → 【Sent via SBUS to ES02 control board】
                         ↓
       【Robot executes physical movement】 → 【Feedback completed】

After you say a command, the entire system will complete the following tasks in a short period of time:

1. Voice Collection → Upload to Cloud

2. Real-time Recognition + Intent Understanding

3. Determine if an Action Should Be Triggered (Function Calling)

4. Generate Execution Parameters, such as Direction, Angle, Height, etc.

5. Send Commands via Serial Port → Control Robot Actions.

3. Overview of System Architecture (Hardware-Software Collaboration)

To create a robot that can genuinely “understand human speech”, one must rely not only on a powerful AI model but also on the efficient collaboration between hardware and software. This section will delve into the architecture of this project from two major aspects:

▪️Hardware Structure Design: Why did we choose RDK X5? How does the robot respond to actions?

▪️Software Module Responsibilities: The functionality and collaboration processes of each segment of code.

3.1 Hardware Structure: RDK X5 + ES02 Creates a Hardware-Integrated Closed-Loop System

3.1.1 Project Background and Requirement Analysis

A voice-controlled robot system must meet the following hardware requirements at a minimum:

3.1.2 Why Ultimately Choose RDK X5?

We tested and compared three mainstream development platforms on the market, and the results are as follows:

Considering cost, audio capabilities, AI API support, and ease of deployment, RDK X5 emerged as the optimal choice.

The system hardware consists of four main parts: RDK X5 main control board, ES02 robot chassis, audio interface, and serial communication via /dev/ttyS1.

3.2 Software Module Division: Three-Layer Decoupling with Clear Responsibilities

To ensure good maintainability and scalability of the system, the project adopts a three-layer architecture, corresponding to input processing, semantic translation, and low-level execution.

3.2.1 Overview of Module Structure

3.2.2 Detailed Explanation of Module Responsibilities

📡 Realtime.py – The Smart Control Brain

▪️Calls OpenAI’s real-time API, uploads audio streams, and listens for returned function calls.

▪️Parses recognition results and determines whether it’s a function call.

▪️If it matches predefined action functions, it calls move_robot(action, value).

▪️Supports multi-threaded audio processing, WebSocket bi-directional communication, and real-time logging.

⚙️ ES02_def_function.py – Action Intermediate Layer Converter

▪️Encapsulates the robot’s action control logic:

▪️advance(), retreat(): Controls forward and backward movement.

▪️left_rotation(), right_rotation(): Minor directional adjustments.

▪️rotate(): Continuous rotation in place.

▪️leg_length(): Adjusts height for standing and squatting.

▪️Control logic follows the pattern: “set channel → wait X seconds → automatically revert to neutral”.

▪️Initiates a background thread to periodically revert timed-out channels to neutral (action termination).

🔌 sbus_out.py – Serial Communication Low-Level Transmitter

▪️Initializes the serial port /dev/ttyS1, setting it to 100000 bps.

▪️Encodes control values of 16 channels into a byte stream using the encode_sbus() function.

▪️Sends the byte stream through the serial port 42 times per second to drive the ES02 motion control module.

3.2.3 Why Use the OpenAI Realtime API?

Compared to traditional voice recognition + control systems, the introduction of the OpenAI Realtime API has led to a simplification of the core structure and a transformative user experience.

🚫 Complex Chain Processes of Traditional Solutions

Voice Input → Speech Recognition (Whisper / Google) → Text Output → NLP Analysis → Match Control Command → Call Action Function

In this process, semantic understanding and control logic are completely separated, requiring multiple intermediary steps to be manually glued together, resulting in:

✅ Innovations Brought by OpenAI Realtime API

Voice input → OpenAI Realtime API → Real-time recognition + Function Calling → Automatically triggers action functions

With the function-calling capability of GPT, the entire voice control process is highly integrated. Core advantages include:

4. Deployment and Operation: From GitHub to Power Control (Suitable for General Users)

This section will guide you through the complete deployment process, from obtaining the project source code and configuring the environment, to ensuring that the robot responds correctly to voice commands. Even if you are not a professional developer, you can quickly enable the robot to “understand you” by following these steps.

4.1 Clone the Code / Prepare the Run Files

First, deploy the project source code to the RDK X5 (or another compatible device).

Method 1: Git Clone (Recommended)

git clone https://github.com/fuwei007/Navbot-ES02/tree/main/src/RDK_X5

Method 2: Manual Copy

Download the source code zip file from GitHub, extract it, and use SCP or a USB drive to copy it to a directory on the RDK X5.

4.2 Install Required Environment Dependencies

In the terminal of the RDK X5, run the following command to install the necessary dependencies:

pip install openai websocket-client pyaudio python-dotenv pyserial

4.3 Configure the .env Environment Variable File

OPENAI_API_KEY=sk-proj-xxx
ADVANCE_DEFAULT_VALUE=10
RETREAT_DEFAULT_VALUE=10
LEFT_ROTATION_DEFAULT_VALUE=90
RIGHT_ROTATION_DEFAULT_VALUE=90
LEG_LENGTH_DEFAULT_VALUE=5

4.4 Start the Main Program

After confirming that the microphone is properly connected, start the main program:

python Realtime.py

Once started, the program will automatically complete:

1. Initialize microphone input

2. Establish WebSocket connection with the OpenAI Realtime API

3. Actively listen for voice → Automatically recognize commands → Execute actions

🎉 At this point, the deployment is complete. You can tell the robot to “move forward,” “turn around,” or “squat,” and it will respond accurately! If you wish to learn more about the code implementation details, you can continue reading this article.

5. Complete Code Implementation Structure and Logic Explanation

5.1 Module One: Realtime.py (Speech Recognition and API Dispatching)

Realtime.py is the brain of the system, responsible for audio input, real-time recognition, event parsing, and function calls. It establishes continuous communication with the OpenAI Realtime API through WebSocket, enabling simultaneous speech recognition and control.

Main Functions and Logic:

Function Execution Flow:

1. Initialize Connection:

▪️connect_to_openai() creates a WebSocket connection to the OpenAI Realtime API.

▪️Simultaneously Start Two Threads:

▪️Audio upload thread: send_mic_audio_to_websocket(ws)

▪️Result reception thread: receive_audio_from_websocket(ws)

2. Send Initial Session Configuration:

▪️send_fc_session_update(ws) registers the supported speech roles, languages, and function declarations (such as advance, rotate, etc.) for the current session with OpenAI.

3. Real-time Upload of Audio Data:

▪️send_mic_audio_to_websocket() captures microphone input through PyAudio, compresses and encodes it in real time, and pushes it to the API via WebSocket.

4. Receive Speech Recognition & Action Requests:

▪️receive_audio_from_websocket() continuously listens for responses from the API, which include: normal speech recognition text、function calls (e.g., advance(dist=0.3))

5. Dispatch Commands to Action Functions:

▪️Once an action request is parsed, handle_function_call() is called to extract parameters and trigger move_robot(action, value).

6. Forward Actions to Control Layer:

▪️move_robot() is the only function in the entire module that directly “executes actions,” responsible for transmitting semantic actions to the ES02 control layer (action middleware).

5.2 Module Two: ES02_def_function.py (Action Control Logic)

This module is responsible for mapping high-level semantics (e.g., “move forward 1 meter”) to low-level control channel changes, completing the encapsulation of action functions and timed restoration. It serves as a bridge between semantics and physical control.

Overview of Action Functions:

Auxiliary Threads and Mechanisms:

▪️start_ES02_ch_timing_processing_thread(): Starts a channel auto-reset thread to prevent action stalling.

▪️ch_timing_thread(): Checks the channel holding time every 100 ms, automatically resetting to the median value (1000).

5.3 Module Three: sbus_out.py (Low-Level Serial Output)

This module is responsible for encoding the channel values set by the upper layer using the SBUS protocol, and sending them to the robot’s control board through the serial port /dev/ttyS1 at a frequency of 42 Hz.

Core Functions:

Protocol Description:

▪️Serial Port: 100000 bps, EVEN parity, 2 stop bits.

▪️Frame Rate: 42 Hz (updating control commands 42 times per second).

▪️Channel Initialization Values: [333, 333, ...], with the control channel median set to 1000.

6. Frequently Asked Questions and Optimization Suggestions

In the process of developing and running a voice-controlled robot system, you may encounter some technical issues or performance bottlenecks. This chapter organizes common fault scenarios and response strategies, while also proposing some advanced optimization directions worth exploring to help you improve the system’s stability and interaction quality.

6.1 Connection Failure or Timeout

Problem Manifestations:

▪️The program gets stuck in the WebSocket connection stage.

▪️Errors such as “handshake failed” or “Temporary failure in name resolution.”

▪️The server does not respond, resulting in a connection timeout.

Cause Analysis:

▪️Network environment restrictions prevent access to the OpenAI API.

▪️Using an IPv6 network while the server does not support it.

▪️Incorrect or missing API key configuration.

▪️Forgetting to set the Authorization field in the request header.

Resolution Suggestions:

▪️Confirm that the OPENAI_API_KEY in the .env file is configured correctly, with the format starting with sk-.

▪️Check whether a proxy or VPN is being used, and set the http_proxy/socks_proxy environment variables if necessary.

▪️Use create_connection_with_ipv4() to force WebSocket to use IPv4 connections.

▪️Implement a retry mechanism to handle occasional network failures.

6.2 Audio Stuttering or Delay

Problem Manifestations:

▪️The playback of the voice response is intermittent.

▪️User speech is not fully recognized or gets cut off.

▪️Control actions respond slowly, or commands may be lost.

Cause Analysis:

▪️Queue congestion causes processing speed to lag.

▪️frames_per_buffer is too small or too large, leading to abnormal audio chunking.

▪️The playback buffer (audio_buffer) is not processed in a timely manner.

Resolution Suggestions:

▪️Set a reasonable CHUNK_SIZE, typically either 320 (20ms) or 512.

▪️Ensure that the sending thread and playback thread are daemon threads (daemon=True) to avoid blocking.

▪️Use performance monitoring tools to observe thread running states and identify bottlenecks promptly.

6.3 Inaccurate Action Execution

Problem Manifestations:

▪️The user says “move forward three steps,” but the robot executes the wrong command.

▪️Semantic recognition failures or missing parameters lead to a NoneType error.

▪️The value in the function call is empty.

Cause Analysis:

▪️The prompt design is unclear, causing the model to misinterpret semantic intent.

▪️Lack of context or clear instruction formats.

▪️Failure to set default parameters for functions leads to parsing errors when parameters are missing.

Resolution Suggestions:

▪️Enhance the clarity of system_instruction, such as: “You are a robot control expert. Please generate action commands based on voice instructions, such as moving forward, backward, or turning left.”

▪️Set default values in the .env file, like ADVANCE_DEFAULT_VALUE=2, to increase fault tolerance.

▪️Increase log outputs to print received parameter content for easier debugging.

▪️Consider designing function calls as a structured schema to clarify the range of fields.

A high-quality system prompt typically consists of the following five parts:

7.Conclusion

Through this project, we have built a real-time voice-controlled robot system based on the OpenAI Realtime API from scratch. The system not only receives user voice commands and performs semantic recognition in real-time but also triggers structured function calls to drive local action control logic, ultimately forming a complete closed loop of “Speak → Understand → Execute → Feedback.” Real-time voice interaction is a significant direction in the field of human-computer interaction. By combining OpenAI Realtime API with Function Calling technology, you will have a highly promising development platform. We hope this tutorial not only helps you complete an interesting project but also inspires you to create more innovative applications based on voice and AI.

Now, pick up the microphone and let your robot “listen to you and act accordingly.”

📁 GitHub Repository (Code Open): https://github.com/fuwei007/Navbot-ES02/tree/main/src/RDK_X5

📽️ Video Demonstration: https://www.youtube.com/watch?v=YUhkF7lPQ0k&t=80s.