Research on IMTalker and LatentSync

OpenAI

Research on IMTalker and LatentSync

Frank Fu

December 3, 2025

IMTalker: Efficient Audio-Driven Talking Face Generation

1. Real-time / High-speed Output (Inference Speed)

▪️ The IMTalker paper explicitly states that it can reach 40 FPS in video-driven mode (driving by video), and 42 FPS in audio-driven mode (audio-to-video), provided that an RTX 4090 GPU is used.
▪️ This means that, given a sufficiently powerful GPU, IMTalker is technically capable of real-time or near real-time output (generating over 40 frames per second, which matches playable video frame rates).

In the implementation on IMTalker – a Hugging Face Space by chenxie95, I tested a 10-second audio clip, and the total processing time was also about 10 seconds. Given this, although the official project does not provide a dedicated “streaming real-time inference” script, splitting long audio into several ~10-second segments and processing them sequentially should still achieve near real-time output overall.

arXiv

2. Support for “Custom Character Training / New Identity / Custom Avatar / Character”

▪️ IMTalker is designed for the task: single image (a static face) + audio (or driving video) → talking face video.

▪️ In other words, it is inherently meant to generate a talking face video for any given reference image. This essentially provides the capability of custom characters (single-image identity).

▪️ The paper also highlights an identity-adaptive module, which maps motion latents to a personalized subspace, allowing the model to preserve the target identity even under cross-identity reenactment (audio/motion from one person, face from another).

▪️ Therefore, IMTalker does support arbitrary / custom identities (at least at the single-image → talking-video level). This means you can give it any clear human face photo (without severe occlusions), and it can generate talking videos without requiring large datasets to specially train that identity.

3. Limitations / Implicit Assumptions

▪️ Although arbitrary faces are supported, the results in the paper rely on a high-end GPU (RTX 4090). Thus, when running on consumer-grade GPUs, the performance (identity preservation, motion fidelity, lip-sync quality) may not match the paper — a typical gap between research environment and production environment.

▪️ If the goal involves complex movements, extreme head-pose changes, expressive facial dynamics, lighting changes, occlusions, or non-standard faces (profile views, partial occlusions), single-image → video methods generally face challenges (not unique to IMTalker). Although the paper claims improvements over prior work in motion accuracy, identity preservation, and audio-lip sync, real-world generalization under extreme conditions still requires testing.

Summary (IMTalker): It supports custom characters (single-image identities), and on high-end GPUs it can achieve over 40 FPS, giving it real-time or near real-time output potential. However, on weaker GPUs or regular devices, both quality and speed may degrade, and this requires testing.

LatentSync

1. Real-time Output Capability / Inference Speed

▪️ This is the weak point or uncertain aspect of LatentSync: the paper does not provide explicit numbers for FPS or real-time inference speed.
▪️ In practice, the workflow of diffusion models + latent space + U-Net + per-frame image-to-image generation + decoding is inherently computationally heavy and high-latency. This makes LatentSync more suitable for offline or batch-render scenarios, and less likely to produce real-time (live, streaming) output like IMTalker. In the implementation LatentSync – a Hugging Face Space by fffiloni, I tested a 20-second audio sample, and the processing time was significantly longer than 20 seconds.

arXiv

2. Model Positioning / Core Design

▪️ LatentSync is an end-to-end lip-sync framework based on an audio-conditioned latent diffusion model. It does not rely on explicit motion representations such as 3D models, 2D landmarks, or optical flow.

▪️ The system generates talking-face video frames by combining audio + reference image / masked frame + diffusion + U-Net + cross-attention.

3. Regarding “Custom Characters (Arbitrary Identity / New Character / Reference Image)”

▪️ LatentSync’s input design follows: reference image + masked frame / image + audio → output frame

▪️ This means you can provide any face reference image (or the first frame) as the identity input. Therefore, by design, it supports arbitrary references / new characters.

▪️ Since it does not depend on predefined motion templates, 3D face models, or a fixed identity embedding space, it is theoretically generalizable to new faces/characters. Many blog posts also describe it as a one-stop lip-sync solution for arbitrary characters + audio → video.

Summary (LatentSync): LatentSync supports arbitrary references / custom characters, but it is not suitable for real-time or streaming output. It is better suited for offline / batch generation (pre-render) scenarios. While its GPU requirements are relatively modest (6–8 GB VRAM is enough to run), achieving high quality or high resolution may require stronger GPUs, more memory, and longer generation times.