The Indexing Bottleneck That's Slowing Down AI: How One Technique Cuts Processing Time in Half

A new optimization technique called IndexCache cuts up to 75% of wasted computation in advanced AI models, delivering 1.82x faster processing on long documents while maintaining reasoning accuracy. Researchers at Tsinghua University and Z.ai identified a hidden inefficiency in sparse attention models like DeepSeek and GLM that causes inference to slow dramatically as context length grows. By reusing token selection decisions across model layers instead of recalculating them repeatedly, IndexCache solves a bottleneck that has plagued long-context AI applications .

Why Do Long Documents Slow Down AI Models So Much?

Large language models (LLMs) work by calculating relationships between tokens, or small chunks of text. When processing a document, the model must determine which previous tokens matter most for predicting the next one. This process, called self-attention, normally requires the model to check every token against every other token, creating a computational cost that grows exponentially with document length .

To solve this problem, DeepSeek introduced sparse attention, which lets the model focus only on the most relevant tokens instead of all of them. This cuts the main computation from exponential to linear growth. However, researchers discovered a critical flaw: the system that decides which tokens are relevant, called the indexer, still operates at exponential complexity at every single layer of the model. As context length increases, the time spent on indexing skyrockets, especially during the initial processing phase .

What Makes IndexCache Different From Other Optimization Approaches?

The breakthrough came when researchers noticed something unexpected: the tokens selected as important by one layer remain remarkably stable in the next layer. Testing on DeepSeek sparse attention models revealed that adjacent layers share between 70% and 100% of their selected tokens. This cross-layer redundancy was the key to solving the indexing bottleneck .

IndexCache exploits this redundancy by dividing model layers into two types. A small number of "full" layers actively calculate and cache which tokens matter most. The remaining "shared" layers skip this calculation entirely and reuse the cached decisions from the nearest full layer. During inference, the model simply checks the layer type and either computes fresh indices or copies cached ones .

"IndexCache is not a traditional KV cache compression or sharing technique. It eliminates this redundancy by reusing indices across layers, thereby reducing computation rather than just memory footprint. It is complementary to existing approaches and can be combined with them," explained Yushi Bai, co-author of the paper.

Yushi Bai, Co-author, Tsinghua University and Z.ai

Unlike other optimization methods that shrink memory usage, IndexCache directly attacks the computation problem. This makes it complementary to existing techniques and allows teams to combine it with other efficiency improvements .

How to Implement IndexCache for Your AI Models

  • Training-Free Approach: For teams using existing DeepSeek or GLM models, a greedy algorithm automatically determines the optimal placement of full and shared layers by running a small calibration dataset through the model, requiring no retraining or weight updates.
  • Training-Aware Method: For organizations pre-training or heavily fine-tuning their own foundation models, a training-aware version optimizes network parameters during training using multi-layer distillation loss to ensure shared layers select consensus tokens relevant across all subsequent layers.
  • Domain-Specific Calibration: The quality of layer configuration depends on the calibration data used, so teams should use domain-specific data matching their actual use cases for optimal results.

What Are the Real-World Performance Gains?

Testing on the 30-billion-parameter GLM-4.7 Flash model at 200,000 token context length showed dramatic improvements. The prefill stage, where the model first processes the input prompt, dropped from 19.5 seconds to 10.7 seconds, a 1.82x speedup. During the decoding phase where the model generates its response, throughput increased from 58 tokens per second to 86 tokens per second, a 1.48x improvement. When servers handle multiple requests simultaneously, total decode throughput jumped by up to 51% .

For enterprise teams, these gains translate directly to cost savings. According to the research team, IndexCache provides at least a 20% reduction in deployment costs for long-context workloads like document analysis, retrieval-augmented generation (RAG), and agentic workflows. For shorter documents, benefits hover around 5% .

Remarkably, these efficiency gains did not compromise the model's reasoning abilities. When 75% of indexers were removed using the training-free approach, the 30-billion-parameter model matched the original baseline on long-context benchmarks, scoring 49.9 compared to the original 50.2. On the highly complex AIME 2025 math reasoning benchmark, the optimized model actually outperformed the original, scoring 92.6 compared to 91.0 .

Preliminary experiments on the production-scale 744-billion-parameter GLM-5 model showed similar promise. Eliminating 75% of indexers with the training-free method yielded at least a 1.3x speedup on contexts exceeding 100,000 tokens while maintaining nearly identical quality on long-context tasks .

Which AI Models Can Use IndexCache Today?

IndexCache applies specifically to models using the DeepSeek Sparse Attention architecture, which was first introduced in DeepSeek-V3.2. This includes the latest DeepSeek models and the current GLM family of models. Teams using other model architectures would need different optimization approaches. The technique is particularly valuable for applications requiring extended context windows, such as processing large documents, running multi-step agentic workflows, or executing long chain-of-thought reasoning tasks .

The development of IndexCache represents a shift in how researchers approach AI efficiency. Rather than focusing solely on memory compression, the technique targets the computational bottleneck that slows inference on long documents. As enterprises increasingly deploy reasoning models for complex tasks, this kind of targeted optimization becomes essential for delivering fast, cost-effective service at scale.