Self-Driving Cars Are Failing a New Reality Test, and Waymo's Dominance May Be Hiding a Bigger Problem

A new benchmark test is exposing a troubling gap in autonomous vehicle safety: self-driving cars trained on standard scenarios perform dramatically worse when faced with truly unexpected obstacles. Researchers at the University of Tübingen in Germany unveiled a testing framework called Fail2Drive that introduces out-of-distribution scenarios into autonomous vehicle simulations, revealing that even the most advanced models drop in performance by an average of 22.8 percent.

Why Are Self-Driving Cars Struggling With Unexpected Obstacles?

The core issue isn't that autonomous vehicles can't handle elephants crossing city streets, though that's one of the test scenarios. Rather, the problem is far more fundamental: most self-driving car models are trained and evaluated on similar types of data, which means they may be memorizing patterns rather than truly learning to drive safely. Andreas Geiger, head of the Autonomous Vision Group at the University of Tübingen, explained the stakes clearly.

"There's a relatively quiet but serious problem in autonomous driving research: most models are trained and evaluated not on the same exact data, but on the same scenarios. What looks like strong benchmark performance may just be strong memorization," stated Andreas Geiger.

Andreas Geiger, Head of Autonomous Vision Group, University of Tübingen

In Fail2Drive testing, autonomous vehicles encountered scenarios that ranged from absurd to genuinely dangerous. Simulated AVs mowed down elephants lumbering across city streets, crashed into playground slides sitting in the middle of roads, and slammed into fire trucks parked in traffic lanes at full speed. One test even featured a Looney Tunes-style painted wall designed to look like an open road, a visual trick that has confounded real-world self-driving cars.

What Does This Mean for Waymo and the Robotaxi Industry?

Waymo currently dominates the commercial autonomous vehicle space in the United States. The Alphabet Inc. subsidiary operates more than 800 autonomous vehicles across a 260-square-mile section of the San Francisco Bay Area alone, with hundreds more deployed in Phoenix, Los Angeles, Miami, Atlanta, and Austin. The company has submitted 449 incident reports to the National Highway Traffic Safety Administration (NHTSA), and Waymo claims its data indicates that autonomous vehicles are safer drivers than humans, with over 200 million autonomously driven miles logged.

However, the Fail2Drive research raises a critical question: are these impressive statistics a reflection of genuine safety, or are they partly the result of testing in familiar, well-mapped environments? Waymo's vehicles operate in controlled urban areas where the company has extensive mapping data and has tested repeatedly. The real-world scenarios that cause accidents, however, are often the unexpected ones.

Zoox, Amazon's autonomous vehicle subsidiary, has submitted far fewer incident reports to NHTSA, with 22 reports in the second half of 2025 compared to Waymo's 449. Zoox operates in fewer markets, covering only a few square miles in northeast San Francisco and a portion of Las Vegas. Tesla's Robotaxi service, meanwhile, operates in parts of Texas without human operators available to take over control.

How Are Autonomous Vehicle Companies Addressing These Robustness Concerns?

The challenge for robotaxi operators is that real-world driving involves countless edge cases that are difficult to predict or simulate. Both Waymo and Zoox rely on sensor suites that include multiple cameras, radar units, and lidar systems to detect objects and navigate their environments. Zoox, which developed a purpose-built autonomous vehicle rather than retrofitting existing cars, has invested heavily in maneuverability and redundancy. Each end of the Zoox vehicle houses its own 67-kilowatt-hour battery pack, 134-horsepower drive motor, and steering system, allowing the vehicle to continue operating even if one system fails.

Waymo has taken a different approach, partnering with manufacturers like Jaguar and the Chinese brand Zeekr to install its sensor packages on existing vehicles. The company operates a factory in Mesa, Arizona, where it customizes vehicles for autonomous operation, with capacity to produce at least 10,000 units annually. Zoox operates a similar manufacturing facility in Hayward, California.

The Fail2Drive benchmark suggests that companies need to expand their testing beyond familiar scenarios. Geiger's team designed the framework to introduce heaps of out-of-distribution scenarios into CARLA, an open-source simulator widely used in autonomous driving research. The goal is to identify weaknesses before they cause real-world accidents.

Steps to Improve Autonomous Vehicle Safety Testing

  • Expand Scenario Diversity: Test autonomous vehicles on unexpected obstacles and edge cases that fall outside standard training data, such as unusual animals, painted road illusions, and parked emergency vehicles blocking traffic lanes.
  • Implement Independent Evaluation: Conduct testing by third parties rather than relying solely on data shared by autonomous vehicle companies, which may not fully represent real-world performance or incident frequency.
  • Validate Real-World Performance: Compare simulation results with actual performance metrics from deployed fleets, ensuring that high benchmark scores translate to genuine safety improvements on public roads.
  • Test Sensor Redundancy: Verify that autonomous vehicles can safely operate when individual sensors fail or provide conflicting information, particularly in low-light conditions where cameras are less reliable than radar or lidar.

The robotaxi industry is moving rapidly toward commercialization. Waymo is charging customers for rides in multiple cities, while Zoox holds an NHTSA waiver allowing it to operate demonstration vehicles without steering wheels or pedals, though it cannot yet charge for commercial rides. Tesla's Robotaxi service operates in Texas without human operators.

Yet the Fail2Drive research underscores that speed to market may come at a cost. When autonomous vehicles encounter scenarios they haven't been explicitly trained on, their performance drops significantly. This suggests that the current generation of robotaxis may be optimized for the specific cities and conditions where they operate, rather than being genuinely robust to the chaos of real-world driving.

As these services expand to new cities and encounter new road conditions, the gap between benchmark performance and real-world safety could widen. The question facing regulators and the industry is whether current testing frameworks are sufficient to catch these vulnerabilities before they result in accidents. The Fail2Drive benchmark offers one answer: they're not.