Ant Group's New 3D Mapping Model Brings Real-Time Spatial Understanding to Robots and AR Devices

Ant Group's Robbyant division has open-sourced LingBot-Map, a breakthrough streaming 3D reconstruction model that lets robots, autonomous vehicles, and augmented reality devices perceive their three-dimensional surroundings in real-time using only a standard RGB camera. The model represents a significant leap forward in spatial understanding technology, achieving accuracy improvements that could reshape how embodied AI systems navigate and interact with the physical world .

How Does LingBot-Map Outperform Existing 3D Reconstruction Methods?

Unlike traditional 3D reconstruction approaches that process complete sets of images offline after recording, LingBot-Map operates on a "see-as-you-go" principle. It continuously estimates the camera's position and reconstructs the scene's 3D structure frame-by-frame as video is captured in real-time . This streaming approach is critical for applications like robot navigation and obstacle avoidance, where waiting for offline processing would be impractical.

The performance improvements are substantial. On the Oxford Spires dataset, known for its large scale and challenging lighting conditions, LingBot-Map achieved an Absolute Trajectory Error of just 6.42 meters. This represents a remarkable 2.8x improvement in trajectory accuracy over the previous best streaming method and significantly outperforms offline methods like DA3 (12.87 meters) and VIPE (10.52 meters) . On the ETH3D benchmark, LingBot-Map achieved a reconstruction F1 score of 98.98, more than 21 percentage points higher than the second-place method .

What Makes LingBot-Map Technically Different?

The core challenge in streaming 3D reconstruction lies in balancing geometric accuracy, temporal consistency, and computational efficiency. LingBot-Map addresses this through a novel pure auto-regressive modeling approach built on a Geometric Context Transformer. The model's key innovation is its Geometric Context Attention mechanism, which efficiently organizes and utilizes geometric information across frames, allowing the model to retain crucial historical context while minimizing redundant computation .

Beyond precision, LingBot-Map achieves both real-time performance and long-term stability. The model achieves an inference speed of approximately 20 frames per second and supports continuous inference on long video sequences exceeding 10,000 frames with almost unchanged accuracy. This capability is fundamental for applications requiring continuous, online spatial awareness .

Steps to Access and Implement LingBot-Map

  • Access the Model: LingBot-Map is available on Hugging Face at huggingface.co/robbyant/lingbot-map, making it freely accessible to researchers and developers worldwide .
  • Review Technical Documentation: The technical report is available on arXiv at arxiv.org/abs/2604.14141, providing detailed information about the model's architecture and performance metrics .
  • Explore Code and Demos: The complete code and interactive demonstrations are available on GitHub at github.com/Robbyant/lingbot-map, allowing developers to experiment with the model before integration .

The launch of LingBot-Map marks a significant step in Robbyant's mission to build a comprehensive intelligent foundation for embodied AI. It follows the recent open-sourcing of several other major models that form a complete technology stack for robotics and spatial AI applications .

What Other Models Support Robbyant's Embodied AI Vision?

LingBot-Map is part of a broader ecosystem of open-source models that Robbyant has released to advance embodied artificial intelligence. These complementary models address different aspects of robotic perception and control :

  • LingBot-Depth: A high-precision spatial perception model that provides detailed depth information for obstacle detection and scene understanding .
  • LingBot-VLA: A general-purpose Vision-Language-Action model that enables robots to understand visual scenes, process natural language instructions, and execute corresponding physical actions .
  • LingBot-World: A world model for environmental simulation that allows robots to predict how their actions will affect their surroundings .
  • LingBot-VA: An auto-regressive video-action model specifically designed for robot control and motion planning .

Together, these models create a unified foundation for embodied AI applications. By providing robust solutions for real-time spatial understanding and online 3D mapping, Robbyant has strengthened its technology stack for robotics, autonomous vehicles, and augmented reality systems. The open-source approach on Hugging Face democratizes access to these advanced capabilities, enabling researchers and developers to build next-generation intelligent devices without requiring proprietary tools or expensive licensing agreements .

Robbyant, the embodied intelligence company within Ant Group, is dedicated to advancing embodied intelligence through cutting-edge software and hardware technologies. The company independently develops foundational large models for embodied AI and actively explores next-generation intelligent devices, aiming to create robotic companions and caregivers that truly understand and enhance people's everyday lives across key use cases such as elderly care, medical assistance, and household tasks .