Artificial intelligence weather models can generate global forecasts twice as fast as physics-based systems, but researchers have discovered a critical blind spot: these AI systems tend to underestimate the intensity and frequency of record-breaking heat, cold, and storms compared to traditional forecasting methods. As weather agencies worldwide begin adopting AI into their operational systems, scientists are calling for standardized testing protocols to ensure these faster forecasts don't sacrifice accuracy when it matters most. Why Are Weather Agencies Switching to AI Forecasts? For decades, meteorological agencies relied on physics-based numerical weather prediction models that simulate the atmosphere by feeding global observational data into equations grounded in the fundamental laws of motion and thermodynamics. These systems work by solving complex physical equations step by step across millions of grid points, a process that takes considerable computing time. AI weather models operate differently. Instead of solving equations, they use algorithms trained on historical weather data to map current conditions directly to a likely future state. Most of the heavy computing happens during the training phase, so generating an actual forecast mainly involves passing observational data through layers of simple arithmetic operations like multiplication and addition. Modern computers can perform these calculations rapidly. The speed advantage is substantial: a 14-day global AI weather forecast can be produced two hours earlier than one generated by a physics-based system. For meteorologists coordinating evacuations ahead of hurricanes or cyclones, that two-hour margin could be crucial for saving lives. This efficiency advantage, combined with lower operational costs, has already prompted major agencies like the European Centre for Medium-Range Weather Forecasts in Reading, UK, to begin integrating AI into their operational forecasting systems. What's the Problem With AI Weather Predictions? The catch is that scientists do not yet know how reliable AI-based predictions are when forecasting rare, extreme weather events. Physics-based forecasts should remain valid even as the climate changes because they are grounded in fundamental physical laws. AI systems, by contrast, are trained on historical data and could falter when confronted with events that differ radically from anything they have seen previously. Research examining AI model performance on specific hazards reveals concerning patterns. Leading AI models forecast the tracks and, to some extent, the intensity of typical tropical cyclones well, but their skill drops significantly for storms with no precedent in the training set. For temperature extremes, some AI and hybrid models can broadly reproduce the frequency and spatial patterns of historical heatwaves and cold spells that occurred outside the period on which they were trained, though with regional biases. Most troublingly, AI systems tend to underestimate the intensity and frequency of record-breaking heat, cold, and wind events compared with leading physics-based models. How Can Meteorologists Test AI Forecasts Reliably? The meteorological community currently lacks an agreed-upon method for systematically evaluating how well AI forecasting systems perform compared with physics-based counterparts. National meteorological services around the world face a dilemma: AI systems are cheaper to run, but there is no standardized way to assess their reliability on extreme events. To address this gap, researchers are proposing a new framework called the AI Retraining Without Iconic Events (AIRWIE) protocol. This approach would require the meteorological community to agree on which high-impact events constitute a rigorous benchmark for testing. The protocol would deliberately withhold a designated set of "iconic" extreme events from the training data, reserving them solely for testing purposes. Before weather agencies adopt AI models operationally, the predictive skill of such models on a range of hazardous events must pass a defined minimum standard. These hazards would include: - Heatwaves: Extreme temperature events that can cause widespread health emergencies and infrastructure failures - Heavy Rainfall: Precipitation events that trigger flooding and landslides in vulnerable regions - Major Storms: Tropical cyclones, hurricanes, and severe thunderstorms that pose direct threats to life and property The AIRWIE protocol would ensure that any model is evaluated against the same out-of-sample extremes before being deployed operationally by a public forecasting agency. This consensus-driven, standardized evaluation approach addresses a fundamental problem: conclusions about AI performance in weather forecasting remain highly sensitive to how extremes are defined, which hazards are considered, and where the extreme events occur. What Does This Mean for Weather Forecasting's Future? The tension between speed and reliability reflects a broader challenge in meteorology. Improvements in weather forecasting rank among science's success stories of the twentieth century. Back in the 1970s, tropical cyclones killed tens of thousands or even hundreds of thousands of people, whereas today these storms rarely cause more than a few dozen deaths. This dramatic improvement came largely from the adoption of physics-based numerical weather prediction models that enabled timely evacuation and adequate preparation. The arrival of AI weather models promises to accelerate forecasting further, but the scientific community is rightfully cautious about trading proven accuracy for speed without rigorous validation. The challenge is particularly acute because climate change is making extreme weather events more frequent and potentially more intense, meaning AI models will increasingly encounter conditions outside their training data. "Establishing the accuracy and reliability of AI-based models is becoming more urgent because several agencies, including the European Centre for Medium-Range Weather Forecasts based in Reading, UK, have already begun integrating AI into their operational forecasting systems," the researchers noted in their assessment of the issue. Researchers, Nature The path forward requires the weather and climate community to set clear standards, starting with agreed data sets for testing out-of-sample extreme-event predictions objectively. Only with such standardized protocols can meteorologists confidently adopt AI forecasting while maintaining the life-saving accuracy that modern weather prediction has achieved over the past 50 years.