Your cart is currently empty!
Building a Voice-Controlled Robot Using OpenAI Realtime API: A Full-Link Implementation from RDK X5 to ES02
BY

1. Introduction: Enabling Robots to Understand Human Language
Today, consumer-grade robots are no longer high-threshold research devices; they are increasingly being integrated into practical scenarios such as education, entertainment, interaction, and even companionship. Various development kits, including wheeled chassis, quadrupedal robots, and robotic arms, have emerged, significantly lowering the barrier at the hardware level.
However, challenges arise: How can we communicate more naturally with robots?
1.1 Limitations of Traditional Control Methods
Currently, the vast majority of robots still rely on the following control methods:

Each of these solutions has its merits and drawbacks, but they all face a common issue:
Human and robot communication remains a ālanguage barrierāārobots cannot truly understand the meaning of human natural language.
What we desire is not to “press a button to make the robot move,” but rather to “speak a command, and the robot knows what to do.”
1.2 Why Voice + AI + Real-time?
“Getting robots to understand what you say” requires three key capabilities:
1. Accurately recognizing what the human has said (speech recognition).
2. Understanding the semantic meaning or intended action of the phrase (natural language understanding).
3. Translating the intention into executable action controls (hardware driving).
The challenges we face have been:
āŖļøEither low quality in speech recognition (especially for multilingual support).
āŖļøInability to directly convert semantics into executable functions (lack of structured output).
āŖļøExcessive latency or complex deployment, making it unsuitable for embedded devices.
1.3 How Does This Project Break Through the Bottleneck?
In this project, we propose a complete closed-loop solution that integrates “speech recognition ā action understanding ā actual execution” into an automated process:
āŖļøUtilizing OpenAIās latest Realtime API: to achieve real-time voice transmission and action function calls.
āŖļøBased on the RDK X5 development board: a compact Linux computing platform that supports audio collection, network connectivity, and running Python.
āŖļøCoupled with the ES02 gait robot: which can perform various actions like moving forward, turning, standing up, and squatting through serial control.
The entire system implements:
š¤ You say a phrase ā š§ Cloud recognition and understanding ā š§ Serial command issued ā š¤ Robot action executed.
No training or pre-setting of voice command vocabularies is required, it supports mixed Chinese and English, automatic multilingual recognition, low-latency interaction, and keeps hardware costs under $100, providing excellent cost performance.
1.4 This Article Will Guide You Throughā¦
This project offers a fully reproducible voice robot solution; this article will delve into:
āŖļøDemonstration of the projectās actual effectiveness (real-time control in Chinese and English).
āŖļøIntroduction to the overall system architecture and hardware/software modules.
āŖļøReasons for development board selection and cost analysis.
āŖļøEvery step from deployment to operation (no complex configurations needed).
āŖļøA complete code breakdown and explanation for developers.
Whether you are:
āŖļøLooking to create your own voice robot at a low cost.
āŖļøInterested in understanding the practical implementation of the OpenAI Realtime API.
āŖļøOr eager to learn about integrating embedded AI applications.
You can find relevant content in this project.
Quick Start Tip: If you want to jump right into the experience and skip the theoretical explanations and system analysis, please directly go to Chapter 4 “Deployment and Operation: From GitHub to Power Control,” where you will find the complete installation steps, dependency configurations, and operating methods.
2. Project Demonstration Effects and Basic Principles
Before delving into the technical solutions, let’s take a look at the effects achieved by this project. By integrating voice recognition, function calling, and real-time control, we have successfully implemented a robot voice control system that supports natural language interaction, recognizes both Chinese and English, and reacts in real time. Whether itās āTurn leftā or āStand upā, the robot can accurately understand and execute corresponding actions, achieving a human-like interaction experience. The entire system requires no screen and no remote control; it relies solely on a spoken command.
2.1 Demonstration Effects
The core demonstration video of this project is as follows:
In the project video, we control the robot to complete the following key actions through voice commands:

From the demonstration, it is evident that the system not only achieves high accuracy in voice recognition but also parses semantics and correctly maps them to machine instructions. More importantly, all commands are spoken in natural language, entirely eliminating the need for a remote control or screen interaction. Additionally, it showcases the ability to switch languages within the same session, with the system still able to accurately recognize and respond, demonstrating its multilingual processing capability.
2.2 Overall System Flowchart
To achieve such a natural control experience, we have constructed an entire set of collaborative soft and hardware systems. Below is a comprehensive overview of the projectās technical process:
ćVoice Inputć ā ćRDK X5 captures audioć ā ćSend to OpenAI Realtime APIć
ā
ćOpenAI real-time transcription + intent recognition + function callć
ā
ćStructured action command generatedć ā ćParsed by local Python moduleć
ā
ćControl values generated for serial outputć ā ćSent via SBUS to ES02 control boardć
ā
ćRobot executes physical movementć ā ćFeedback completedć
After you say a command, the entire system will complete the following tasks in a short period of time:
1. Voice Collection ā Upload to Cloud
2. Real-time Recognition + Intent Understanding
3. Determine if an Action Should Be Triggered (Function Calling)
4. Generate Execution Parameters, such as Direction, Angle, Height, etc.
5. Send Commands via Serial Port ā Control Robot Actions.
3. Overview of System Architecture (Hardware-Software Collaboration)
To create a robot that can genuinely “understand human speech”, one must rely not only on a powerful AI model but also on the efficient collaboration between hardware and software. This section will delve into the architecture of this project from two major aspects:
āŖļøHardware Structure Design: Why did we choose RDK X5? How does the robot respond to actions?
āŖļøSoftware Module Responsibilities: The functionality and collaboration processes of each segment of code.
3.1 Hardware Structure: RDK X5 + ES02 Creates a Hardware-Integrated Closed-Loop System
3.1.1 Project Background and Requirement Analysis
A voice-controlled robot system must meet the following hardware requirements at a minimum:

3.1.2 Why Ultimately Choose RDK X5?
We tested and compared three mainstream development platforms on the market, and the results are as follows:

Considering cost, audio capabilities, AI API support, and ease of deployment, RDK X5 emerged as the optimal choice.
The system hardware consists of four main parts: RDK X5 main control board, ES02 robot chassis, audio interface, and serial communication via /dev/ttyS1.
3.2 Software Module Division: Three-Layer Decoupling with Clear Responsibilities
To ensure good maintainability and scalability of the system, the project adopts a three-layer architecture, corresponding to input processing, semantic translation, and low-level execution.
3.2.1 Overview of Module Structure

3.2.2 Detailed Explanation of Module Responsibilities
š” Realtime.py ā The Smart Control Brain
āŖļøCalls OpenAI’s real-time API, uploads audio streams, and listens for returned function calls.
āŖļøParses recognition results and determines whether it’s a function call.
āŖļøIf it matches predefined action functions, it calls move_robot(action, value).
āŖļøSupports multi-threaded audio processing, WebSocket bi-directional communication, and real-time logging.
āļø ES02_def_function.py ā Action Intermediate Layer Converter
āŖļøEncapsulates the robot’s action control logic:
āŖļøadvance(), retreat(): Controls forward and backward movement.
āŖļøleft_rotation(), right_rotation(): Minor directional adjustments.
āŖļørotate(): Continuous rotation in place.
āŖļøleg_length(): Adjusts height for standing and squatting.
āŖļøControl logic follows the pattern: “set channel ā wait X seconds ā automatically revert to neutral”.
āŖļøInitiates a background thread to periodically revert timed-out channels to neutral (action termination).
š sbus_out.py ā Serial Communication Low-Level Transmitter
āŖļøInitializes the serial port /dev/ttyS1, setting it to 100000 bps.
āŖļøEncodes control values of 16 channels into a byte stream using the encode_sbus() function.
āŖļøSends the byte stream through the serial port 42 times per second to drive the ES02 motion control module.
3.2.3 Why Use the OpenAI Realtime API?
Compared to traditional voice recognition + control systems, the introduction of the OpenAI Realtime API has led to a simplification of the core structure and a transformative user experience.
š« Complex Chain Processes of Traditional Solutions
Voice Input ā Speech Recognition (Whisper / Google) ā Text Output ā NLP Analysis ā Match Control Command ā Call Action FunctionIn this process, semantic understanding and control logic are completely separated, requiring multiple intermediary steps to be manually glued together, resulting in:

ā Innovations Brought by OpenAI Realtime API
Voice input ā OpenAI Realtime API ā Real-time recognition + Function Calling ā Automatically triggers action functionsWith the function-calling capability of GPT, the entire voice control process is highly integrated. Core advantages include:

4. Deployment and Operation: From GitHub to Power Control (Suitable for General Users)
This section will guide you through the complete deployment process, from obtaining the project source code and configuring the environment, to ensuring that the robot responds correctly to voice commands. Even if you are not a professional developer, you can quickly enable the robot to “understand you” by following these steps.
4.1 Clone the Code / Prepare the Run Files
First, deploy the project source code to the RDK X5 (or another compatible device).
Method 1: Git Clone (Recommended)
git clone https://github.com/fuwei007/Navbot-ES02/tree/main/src/RDK_X5Method 2: Manual Copy
Download the source code zip file from GitHub, extract it, and use SCP or a USB drive to copy it to a directory on the RDK X5.
4.2 Install Required Environment Dependencies
In the terminal of the RDK X5, run the following command to install the necessary dependencies:
pip install openai websocket-client pyaudio python-dotenv pyserial4.3 Configure the .env Environment Variable File
OPENAI_API_KEY=sk-proj-xxx
ADVANCE_DEFAULT_VALUE=10
RETREAT_DEFAULT_VALUE=10
LEFT_ROTATION_DEFAULT_VALUE=90
RIGHT_ROTATION_DEFAULT_VALUE=90
LEG_LENGTH_DEFAULT_VALUE=54.4 Start the Main Program
After confirming that the microphone is properly connected, start the main program:
python Realtime.pyOnce started, the program will automatically complete:
1. Initialize microphone input
2. Establish WebSocket connection with the OpenAI Realtime API
3. Actively listen for voice ā Automatically recognize commands ā Execute actions
š At this point, the deployment is complete. You can tell the robot to “move forward,” “turn around,” or “squat,” and it will respond accurately! If you wish to learn more about the code implementation details, you can continue reading this article.
5. Complete Code Implementation Structure and Logic Explanation
5.1 Module One: Realtime.py (Speech Recognition and API Dispatching)
Realtime.py is the brain of the system, responsible for audio input, real-time recognition, event parsing, and function calls. It establishes continuous communication with the OpenAI Realtime API through WebSocket, enabling simultaneous speech recognition and control.
Main Functions and Logic:

Function Execution Flow:

1. Initialize Connection:
āŖļøconnect_to_openai() creates a WebSocket connection to the OpenAI Realtime API.
āŖļøSimultaneously Start Two Threads:
āŖļøAudio upload thread: send_mic_audio_to_websocket(ws)
āŖļøResult reception thread: receive_audio_from_websocket(ws)
2. Send Initial Session Configuration:
āŖļøsend_fc_session_update(ws) registers the supported speech roles, languages, and function declarations (such as advance, rotate, etc.) for the current session with OpenAI.
3. Real-time Upload of Audio Data:
āŖļøsend_mic_audio_to_websocket() captures microphone input through PyAudio, compresses and encodes it in real time, and pushes it to the API via WebSocket.
4. Receive Speech Recognition & Action Requests:
āŖļøreceive_audio_from_websocket() continuously listens for responses from the API, which include: normal speech recognition textćfunction calls (e.g., advance(dist=0.3))
5. Dispatch Commands to Action Functions:
āŖļøOnce an action request is parsed, handle_function_call() is called to extract parameters and trigger move_robot(action, value).
6. Forward Actions to Control Layer:
āŖļømove_robot() is the only function in the entire module that directly “executes actions,” responsible for transmitting semantic actions to the ES02 control layer (action middleware).
5.2 Module Two: ES02_def_function.py (Action Control Logic)
This module is responsible for mapping high-level semantics (e.g., “move forward 1 meter”) to low-level control channel changes, completing the encapsulation of action functions and timed restoration. It serves as a bridge between semantics and physical control.
Overview of Action Functions:

Auxiliary Threads and Mechanisms:
āŖļøstart_ES02_ch_timing_processing_thread(): Starts a channel auto-reset thread to prevent action stalling.
āŖļøch_timing_thread(): Checks the channel holding time every 100 ms, automatically resetting to the median value (1000).
5.3 Module Three: sbus_out.py (Low-Level Serial Output)
This module is responsible for encoding the channel values set by the upper layer using the SBUS protocol, and sending them to the robot’s control board through the serial port /dev/ttyS1 at a frequency of 42 Hz.
Core Functions:

Protocol Description:
āŖļøSerial Port: 100000 bps, EVEN parity, 2 stop bits.
āŖļøFrame Rate: 42 Hz (updating control commands 42 times per second).
āŖļøChannel Initialization Values: [333, 333, ...], with the control channel median set to 1000.
6. Frequently Asked Questions and Optimization Suggestions
In the process of developing and running a voice-controlled robot system, you may encounter some technical issues or performance bottlenecks. This chapter organizes common fault scenarios and response strategies, while also proposing some advanced optimization directions worth exploring to help you improve the system’s stability and interaction quality.
6.1 Connection Failure or Timeout
Problem Manifestations:
āŖļøThe program gets stuck in the WebSocket connection stage.
āŖļøErrors such as “handshake failed” or “Temporary failure in name resolution.”
āŖļøThe server does not respond, resulting in a connection timeout.
Cause Analysis:
āŖļøNetwork environment restrictions prevent access to the OpenAI API.
āŖļøUsing an IPv6 network while the server does not support it.
āŖļøIncorrect or missing API key configuration.
āŖļøForgetting to set the Authorization field in the request header.
Resolution Suggestions:
āŖļøConfirm that the OPENAI_API_KEY in the .env file is configured correctly, with the format starting with sk-.
āŖļøCheck whether a proxy or VPN is being used, and set the http_proxy/socks_proxy environment variables if necessary.
āŖļøUse create_connection_with_ipv4() to force WebSocket to use IPv4 connections.
āŖļøImplement a retry mechanism to handle occasional network failures.
6.2 Audio Stuttering or Delay
Problem Manifestations:
āŖļøThe playback of the voice response is intermittent.
āŖļøUser speech is not fully recognized or gets cut off.
āŖļøControl actions respond slowly, or commands may be lost.
Cause Analysis:
āŖļøQueue congestion causes processing speed to lag.
āŖļøframes_per_buffer is too small or too large, leading to abnormal audio chunking.
āŖļøThe playback buffer (audio_buffer) is not processed in a timely manner.
Resolution Suggestions:
āŖļøSet a reasonable CHUNK_SIZE, typically either 320 (20ms) or 512.
āŖļøEnsure that the sending thread and playback thread are daemon threads (daemon=True) to avoid blocking.
āŖļøUse performance monitoring tools to observe thread running states and identify bottlenecks promptly.
6.3 Inaccurate Action Execution
Problem Manifestations:
āŖļøThe user says “move forward three steps,” but the robot executes the wrong command.
āŖļøSemantic recognition failures or missing parameters lead to a NoneType error.
āŖļøThe value in the function call is empty.
Cause Analysis:
āŖļøThe prompt design is unclear, causing the model to misinterpret semantic intent.
āŖļøLack of context or clear instruction formats.
āŖļøFailure to set default parameters for functions leads to parsing errors when parameters are missing.
Resolution Suggestions:
āŖļøEnhance the clarity of system_instruction, such as: “You are a robot control expert. Please generate action commands based on voice instructions, such as moving forward, backward, or turning left.”
āŖļøSet default values in the .env file, like ADVANCE_DEFAULT_VALUE=2, to increase fault tolerance.
āŖļøIncrease log outputs to print received parameter content for easier debugging.
āŖļøConsider designing function calls as a structured schema to clarify the range of fields.
A high-quality system prompt typically consists of the following five parts:

7.Conclusion
Through this project, we have built a real-time voice-controlled robot system based on the OpenAI Realtime API from scratch. The system not only receives user voice commands and performs semantic recognition in real-time but also triggers structured function calls to drive local action control logic, ultimately forming a complete closed loop of “Speak ā Understand ā Execute ā Feedback.” Real-time voice interaction is a significant direction in the field of human-computer interaction. By combining OpenAI Realtime API with Function Calling technology, you will have a highly promising development platform. We hope this tutorial not only helps you complete an interesting project but also inspires you to create more innovative applications based on voice and AI.
Now, pick up the microphone and let your robot ālisten to you and act accordingly.ā
š GitHub Repository (Code Open): https://github.com/fuwei007/Navbot-ES02/tree/main/src/RDK_X5
š½ļø Video Demonstration: https://www.youtube.com/watch?v=YUhkF7lPQ0k&t=80s.










Leave a Reply