AWS's Custom Inference Chips Are Selling Out Faster Than They Can Build Them
AWS's custom-designed inference chips are becoming the hottest commodity in AI infrastructure, with demand so intense that major customers are attempting to lock up entire years of production capacity. In Amazon's 2025 shareholder letter, CEO Andy Jassy revealed that two large customers asked to purchase all available 2026 instance capacity for AWS's Graviton chip, a request the company had to decline due to competing customer needs .
Why Is Inference Hardware Suddenly So Valuable?
Inference, the process of running a trained AI model to generate outputs, has become the fastest-growing and most cost-sensitive workload in enterprise artificial intelligence (AI). Unlike training, which happens once, inference happens continuously as users interact with AI systems. This means the economics of inference directly impact a company's profitability at scale .
AWS's Trainium chips are specifically designed to handle this workload efficiently. Trainium2, released in late 2024, offers roughly 30 percent better price-performance than comparable graphics processing units (GPUs), according to Jassy. The newer Trainium3, which just began shipping, delivers 30 to 40 percent better price-performance than Trainium2 and is already nearly fully subscribed. Even Trainium4, still about 18 months away from broad availability, has had a significant portion of its capacity reserved .
"There's so much demand for our chips that it's quite possible we'll sell racks of them to third parties in the future," said Andy Jassy, CEO at Amazon Web Services.
Andy Jassy, CEO at Amazon Web Services
This demand reflects a fundamental shift in how enterprises approach AI economics. Rather than simply shopping for compute capacity, companies are trying to lock up resources before competitors do, creating what analysts call a "strategic dependency" story .
How Is AWS Competing With NVIDIA in the Inference Market?
AWS isn't trying to eliminate NVIDIA; instead, the company is reducing its dependence on the chip leader in areas where AWS can win on economics. AWS brings what analysts describe as a "holistic package" that includes tight integration with Bedrock (AWS's generative AI service), AWS-designed interconnects, more efficient token economics, and a software stack built on standard PyTorch, JAX, and vLLM workflows .
A particularly clever strategy involves partnering with Cerebras, a specialized AI chip company. Trainium is optimized for the "prefill" phase of inference, where the model processes the user's input, while Cerebras's CS-3 chip is optimized for the "decode" phase, where the model generates output tokens one at a time. Together, they deliver what AWS claims is the best inference performance with no user intervention required .
- Trainium2 Performance: Offers approximately 30 percent better price-performance than comparable GPUs and is largely sold out
- Trainium3 Availability: Delivers 30 to 40 percent better price-performance than Trainium2 and is already nearly fully subscribed despite just beginning to ship
- Trainium4 Demand: A significant portion of capacity for this chip, still 18 months from broad availability, has already been reserved by customers
High-profile customers like Anthropic and Uber are testing AWS's efficiency claims in production environments. Other companies like Cohere and Stability AI continue to prefer NVIDIA's mature tooling framework and established ecosystem, citing AWS service and availability concerns .
What Does This Shortage Mean for the Broader AI Industry?
The inference chip shortage reflects a deeper capacity crisis across AWS infrastructure. Jassy noted that AWS added 3.9 gigawatts of new power capacity in 2025 and expects to double its total power capacity by the end of 2027, yet "we still have capacity constraints that yield unserved demand" . This suggests that even with massive infrastructure investments, demand is outpacing supply.
Jassy
The risk for AWS isn't failing to build fast enough; it's that constrained customers might hedge toward competitors like Microsoft Azure or Google Cloud Platform (GCP) to secure alternative capacity. Azure Cobalt and Google Cloud Axion processors, which use ARM architecture rather than traditional x86 processors, will likely see similar demand patterns as they mature, creating an "interesting market dynamic" between competing processor architectures .
"Two large customers asking to buy all of AWS's Graviton capacity for 2026 says everything we need to know about where the market is," noted Matt Kimball, VP and principal analyst at Moor Insights and Strategy.
Matt Kimball, VP and Principal Analyst at Moor Insights and Strategy
Industry analysts emphasize that this is not merely a supply chain story. Scott Bickley, advisory fellow at Info-Tech Research Group, observed that "everything is sold out across the board," even amid reports that 50 percent of planned AI data center capacity will not materialize in 2026 . This suggests that demand destruction from price increases or market saturation hasn't yet occurred.
How AWS Built a Better Inference Engine in 76 Days
Beyond hardware, AWS is also innovating on the software side. When the Bedrock team realized their initial architecture couldn't handle the inference workload efficiently, they didn't settle for incremental improvements. Instead, they spun up a small team of six engineers using AWS's agentic coding service, Kiro, to build an entirely new inference engine called Mantle from scratch. They completed this rebuild in just 76 days .
Mantle has since become the backbone of Bedrock, processing more tokens in the first quarter of 2026 than had been processed in all prior years combined. The engine includes features such as stateful conversation management, asynchronous inference, and higher default quotas. This rapid development demonstrates how AI-assisted development tools are changing what's possible in production environments .
"If six engineers with agentic tools can do what 40 couldn't have done faster, the calculus on team size, project timelines, and build-versus-buy decisions shifts fundamentally," explained Matt Kimball.
Matt Kimball, VP and Principal Analyst at Moor Insights and Strategy
The Mantle story illustrates a broader trend: inference isn't just about hardware anymore. It's about the entire stack, from power infrastructure to custom silicon to optimized software. Companies that can control multiple layers of this stack, as AWS is attempting to do, gain significant competitive advantages in cost and performance .
As enterprise AI adoption accelerates, the inference chip market will likely remain supply-constrained throughout 2026 and beyond. The companies that can deliver both efficient hardware and optimized software will capture the most value in what is shaping up to be the fastest-growing segment of the AI infrastructure market.