MIT Researchers Crack the Code on Shrinking AI Models While They Train
Researchers at MIT have developed a new method that compresses artificial intelligence models while they're still learning, rather than waiting until training is complete. The technique, called CompreSSM, sidesteps the traditional trade-off between model size and performance by identifying and removing unnecessary components early in the training process. This approach could significantly reduce the computational cost and time required to develop powerful AI systems .
Why Does Model Compression During Training Matter?
Training large AI models is expensive in multiple ways. It demands enormous amounts of computing power, electricity, and time. Traditionally, engineers faced an uncomfortable choice: train a massive model and then shrink it down afterward, or train a small model from scratch and accept weaker performance. CompreSSM eliminates this dilemma by making compression decisions while the model is still learning .
The research team, which included scientists from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), Max Planck Institute for Intelligent Systems, European Laboratory for Learning and Intelligent Systems, ETH, and Liquid AI, focused on a family of AI architectures called state-space models. These models power applications ranging from language processing to audio generation and robotics .
"It's essentially a technique to make models grow smaller and faster as they are training. During learning, they're also getting rid of parts that are not useful to their development," said Makram Chahine, a PhD student in electrical engineering and computer science at CSAIL and lead author of the paper.
Makram Chahine, PhD Student in Electrical Engineering and Computer Science, MIT CSAIL
How Does CompreSSM Identify Which Parts of a Model to Remove?
The key insight behind CompreSSM is that the importance of different components within AI models stabilizes surprisingly early during training. The researchers borrowed mathematical tools from control theory to measure how much each internal part contributes to the model's overall behavior. Using a mathematical quantity called Hankel singular values, they can reliably rank which dimensions matter and which don't after only about 10 percent of the training process .
Once those rankings are established, the less-important components can be safely discarded, and the remaining 90 percent of training proceeds at the speed of a much smaller model. The researchers proved mathematically that the importance of individual model states changes smoothly during training, giving practitioners confidence that dimensions identified as unnecessary early on won't suddenly become critical later .
Steps to Understanding CompreSSM's Practical Advantages
- Training Speed: On image classification benchmarks, compressed models trained up to 1.5 times faster than their full-sized counterparts while maintaining nearly the same accuracy.
- Dramatic Compression: On Mamba, one of the most widely used state-space architectures, the method achieved approximately 4x training speedups, compressing a 128-dimensional model down to around 12 dimensions while maintaining competitive performance.
- Performance Preservation: A compressed model reduced to roughly a quarter of its original state dimension achieved 85.7 percent accuracy on the CIFAR-10 benchmark, compared to just 81.8 percent for a model trained at that smaller size from scratch.
- Cost Avoidance: Unlike conventional pruning methods that require training a full model first, or knowledge distillation that requires training two models sequentially, CompreSSM makes compression decisions mid-stream and avoids these computational costs entirely.
The results demonstrate a fundamental advantage over existing approaches. Conventional pruning methods train a full model and then strip away parameters after the fact, meaning you still pay the full computational cost of training the big model. Knowledge distillation, another popular technique, requires training a large "teacher" model to completion and then training a second, smaller "student" model on top of it, essentially doubling the training effort .
"What's exciting about this work is that it turns compression from an afterthought into part of the learning process itself. Instead of training a large model and then figuring out how to make it smaller, CompreSSM lets the model discover its own efficient structure as it learns. That's a fundamentally different way to think about building AI systems," explained Daniela Rus, MIT professor and director of CSAIL.
Daniela Rus, MIT Professor and Director of CSAIL
How Does CompreSSM Compare to Other Compression Methods?
The team benchmarked CompreSSM head-to-head against both conventional alternatives. Compared to Hankel nuclear norm regularization, a recently proposed spectral technique for encouraging compact state-space models, CompreSSM was more than 40 times faster while also achieving higher accuracy. The regularization approach slowed training by roughly 16 times because it required expensive eigenvalue computations at every single gradient step, and even then, the resulting models underperformed .
Against knowledge distillation on CIFAR-10, a standard image classification benchmark, CompreSSM held a clear advantage for heavily compressed models. At smaller state dimensions, distilled models saw significant accuracy drops, while CompreSSM-compressed models maintained near-full performance. Because distillation requires a forward pass through both the teacher and student at every training step, even its smaller student models trained slower than the full-sized baseline .
The method also comes with a practical safety net. If a compression step causes an unexpected performance drop, practitioners can revert to a previously saved checkpoint. This gives engineers control over how much they're willing to sacrifice in terms of performance, rather than having to define a less-intuitive energy threshold .
What Are the Current Limitations of This Approach?
CompreSSM works best on models that exhibit a strong correlation between the internal state dimension and overall performance, a property that varies across tasks and architectures. The method is particularly effective on multi-input, multi-output (MIMO) models, where the relationship between state size and expressivity is strongest. For per-channel, single-input, single-output architectures, the gains are more modest, since those models are less sensitive to state dimension changes in the first place .
The theory applies most cleanly to linear time-invariant systems, although the team has developed extensions for the increasingly popular input-dependent, time-varying architectures. Because the family of state-space models extends to architectures like linear attention, a growing area of interest as an alternative to traditional transformers, the potential scope of application is broad .
"You get the performance of the larger model, because you capture most of the complex dynamics during the warm-up phase, then only keep the most-useful states. The model is still able to perform at a higher level than training a small model from the start," noted Chahine.
Makram Chahine, PhD Student in Electrical Engineering and Computer Science, MIT CSAIL
What's Next for This Research?
Chahine and his collaborators see the work as a stepping stone toward broader applications. The team has already demonstrated an extension to linear time-varying systems like Mamba, and future directions include pushing CompreSSM further into matrix-valued dynamical systems used in linear attention mechanisms, which would bring the technique closer to the transformer architectures that underpin most of today's largest AI systems .
The research represents a meaningful shift in how engineers think about building efficient AI systems. Rather than treating compression as a final polish applied after the heavy lifting is done, CompreSSM integrates efficiency into the learning process itself. As AI models continue to grow larger and more resource-intensive, techniques like this could play an important role in making advanced AI more accessible and sustainable to develop.