How Soul AI's New Open-Source Model Solves the Real-Time Digital Human Problem

Soul AI Lab has released SoulX-LiveAct, an open-source model designed to generate stable digital human video in real-time for extended periods. The model achieves 20 frames per second (FPS) streaming inference at 512x512 resolution on dual H100/H200 graphics processors, with end-to-end latency of approximately 0.94 seconds . This breakthrough addresses a fundamental challenge that has plagued digital human technology: traditional AI video generation models struggle to maintain consistent performance when operating for minutes or hours, often suffering from identity drift, detail degradation, and frame flickering .

What Makes Long-Duration Digital Human Generation So Difficult?

Creating convincing digital humans that can stream continuously is far more complex than generating a single image or short video clip. When video generation extends beyond a few seconds, inference costs rise dramatically, and models begin to lose track of the character's identity and facial features. The problem compounds over time: as a digital human speaks for an hour-long podcast or conducts a customer service interaction, the accumulated computational burden and information drift make it nearly impossible to maintain visual consistency without prohibitive resource costs .

SoulX-LiveAct tackles this challenge through two core architectural innovations. The model uses autoregressive diffusion, a technique that generates video segment by segment while maintaining contextual continuity between chunks. Within each segment, the diffusion model handles fine-grained detail modeling, while condition information flows between chunks to preserve consistent motion and identity .

How Does SoulX-LiveAct Maintain Stability During Extended Generation?

  • Neighbor Forcing: This mechanism propagates latent information from adjacent frames within the same diffusion step, enabling the model to make predictions in a unified noise semantic space and reducing instability caused by distributional inconsistencies between training and inference .
  • ConvKV Memory: This structural innovation compresses historical information by transforming traditionally linearly growing cache into a "short-term precise plus long-term compressed" format, where recent information retains high precision for local detail while older information is compressed through lightweight convolution .
  • RoPE Reset: This technique aligns position encodings to further mitigate positional drift during long-sequence generation, ensuring the model maintains spatial awareness throughout extended video sequences .

The practical result is remarkable: through ConvKV Memory, historical information no longer grows linearly over time, keeping memory usage within a fixed range regardless of video length. This design ensures that computational and communication costs remain stable during prolonged operation, without significant increases as video duration extends .

What Real-World Performance Metrics Does SoulX-LiveAct Achieve?

The model demonstrates strong performance across multiple evaluation benchmarks. On the HDTF dataset, SoulX-LiveAct achieves a Sync-C score of 9.40 and Sync-D of 6.76, with distribution similarity scores of 10.05 FID and 69.43 FVD, measuring lip-sync accuracy and motion consistency . In VBench evaluations, the model scores 97.6 for Temporal Quality and 63.0 for Image Quality, with VBench-2.0 Human Fidelity reaching 99.9, indicating near-perfect visual stability .

On the EMTD dataset, the model maintains leading performance with 8.61 Sync-C and 7.29 Sync-D, achieving 97.3 Temporal Quality and 65.7 Image Quality in VBench, with Human Fidelity at 98.9 . These results demonstrate the model's strong capabilities in maintaining visual consistency and realistic motion throughout extended generation sessions.

The computational efficiency is equally important: the system operates at a computational cost of 27.2 TFLOPs per frame, demonstrating a well-balanced trade-off between real-time capability and resource efficiency . For context, this means the model can generate video frames at a rate that feels natural to viewers while remaining practical for deployment on enterprise-grade hardware.

What Applications Could Benefit From This Technology?

SoulX-LiveAct is designed to support a wide range of scenarios requiring sustained online operation. These include digital human live streaming, AI-powered education platforms, smart service kiosks, and knowledge content production . In open-world interactive settings, digital characters must maintain consistent expressiveness throughout extended interactions, a capability that SoulX-LiveAct's performance on full-body motion datasets and real-time streaming inference provide .

The release of SoulX-LiveAct extends the Soul AI team's broader technical roadmap in real-time digital humans. Previously, the team had open-sourced SoulX-FlashTalk and SoulX-FlashHead, exploring ultra-low latency and lightweight deployment respectively. Additionally, the team has released models and modules in speech and interaction, including SoulX-Podcast, SoulX-Singer, and SoulX-Duplug, progressively building a multimodal technology ecosystem centered on "real-time interaction" .

By continuing to open-source models and technical solutions, the Soul AI team is not only driving the iteration of its own AI capabilities but also providing the developer community with reusable technical foundations. This approach fosters the exploration and deployment of more application scenarios across industries that depend on realistic, stable digital human interactions .