MIT's CompreSSM: How Compressing AI Models During Training Changes Everything
MIT CSAIL researchers have developed CompreSSM, a technique that compresses state-space models during training rather than after, achieving up to 4x speedups while maintaining near-full accuracy.
Training AI Just Got Dramatically Cheaper — And the Secret Was Hiding in Control Theory
Training a large AI model is one of the most resource-intensive endeavors in modern computing. Between the GPU hours, the energy bills, and the weeks (sometimes months) of waiting, the cost of building capable AI systems has become a defining challenge for the entire industry. But what if you could make a model smaller and faster while it's still learning — without sacrificing performance?
That's exactly what researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), in collaboration with the Max Planck Institute, ETH Zurich, and Liquid AI, have achieved with a new technique called CompreSSM. Published as a conference paper at ICLR 2026, the method fundamentally rethinks how we approach AI model compression — and the results are turning heads.
The Core Innovation: Compression Mid-Training
Traditionally, if you want a smaller, faster AI model, you have two options, and neither is ideal. Option one: train a massive model first, then compress it afterward through pruning or quantization — but you've already paid the full cost of training the big model. Option two: train a small model from scratch — but it typically underperforms its larger counterpart.
CompreSSM bypasses this trade-off entirely by compressing during the training process. The technique targets state-space models (SSMs), a family of AI architectures that power applications ranging from language processing to audio generation and robotics. Rather than waiting until training is complete, CompreSSM identifies which parts of the model are actually contributing to performance and surgically removes the dead weight early on.
"It's essentially a technique to make models grow smaller and faster as they are training," explains Makram Chahine, a PhD student at MIT CSAIL and lead author of the paper. "During learning, they're also getting rid of parts that are not useful to their development."
The Math Behind the Magic: Hankel Singular Values
The key insight is surprisingly elegant. The researchers discovered that the relative importance of different components within state-space models stabilizes remarkably early — after just about 10% of the training process. To measure this, they borrowed a mathematical tool from control theory called Hankel singular values, which quantify how much each internal state contributes to the model's overall behavior.
Using these values, the team can reliably rank which dimensions matter and which are essentially noise. Once those rankings are established, the less-important components are safely discarded, and the remaining 90% of training proceeds at the speed of a much smaller model. The team mathematically proved this works using Weyl's theorem, showing that the importance of individual model states changes smoothly during training — meaning dimensions identified as negligible early on won't suddenly become critical later.
The Numbers Speak for Themselves
The benchmarks are compelling. On image classification tasks, compressed models maintained nearly the same accuracy as their full-sized counterparts while training up to 1.5 times faster. A model compressed to roughly a quarter of its original state dimension achieved 85.7% accuracy on CIFAR-10, compared to just 81.8% for a model trained at that smaller size from scratch.
But the most impressive result came on Mamba, one of the most widely used state-space architectures. CompreSSM achieved approximately 4x training speedups, compressing a 128-dimensional model down to around 12 dimensions while maintaining competitive performance. Compared to Hankel nuclear norm regularization (a recently proposed spectral technique), CompreSSM was more than 40 times faster while also achieving higher accuracy.
The method also outperformed knowledge distillation on CIFAR-10 for heavily compressed models. Since distillation requires running both a teacher and student model at every training step, even the smaller student models trained slower than the full-sized baseline — a cost CompreSSM entirely avoids.
What This Means for the AI Industry
The implications extend well beyond academic benchmarks. Training costs are one of the biggest barriers to AI accessibility, particularly for smaller organizations and researchers. A 4x reduction in training time doesn't just save money — it democratizes who can build and experiment with powerful AI models.
"What's exciting about this work is that it turns compression from an afterthought into part of the learning process itself," says Daniela Rus, MIT professor and director of CSAIL. "Instead of training a large model and then figuring out how to make it smaller, CompreSSM lets the model discover its own efficient structure as it learns."
The team has already demonstrated an extension to Mamba's input-dependent, time-varying architectures, and future work aims to push CompreSSM into linear attention mechanisms — bringing it closer to the transformer architectures that underpin most of today's largest AI systems. If successful, this could eventually make training large language models significantly cheaper and faster.
For now, CompreSSM represents a proof that the best time to compress an AI model isn't after it's trained — it's while it's still learning. And that insight alone could reshape how we think about building the next generation of AI systems.
Sources: MIT CSAIL, MIT EECS, arXiv Paper
Comments ()