State-space language models such as Mamba match Transformer quality while permitting linear complexity inference, yet still comprise billions of parameters that hinder deployment. Existing one-shot pruning methods are tailored to attention blocks and fail to account for the time-shared and discretized state-transition matrix at the heart of the selective state-space module (SSM). In this paper, we introduce SparseSSM, the first training-free pruning framework that extends the classic optimal brain surgeon (OBS) framework to state space architectures. Our layer-wise algorithm (i) derives an approximate second-order saliency score that aggregates Hessian-trace information across time steps, (ii) incorporates a component sensitivity analysis to guide feed-forward network (FFN) pruning, which also sheds light on where redundancy resides in mamba architecture, (iii) can be easily extended to semi-structured and structured sparsity. Empirically, we prune 50% of SSM weights without fine-tuning and observe no zero-shot accuracy loss, achieving the current state-of-the-art pruning algorithm for Mamba-based LLMs.
SparseSSM: Efficient Selective Structured State Space Models Can Be Pruned in One-Shot [arXiv]
git clone https://github.com/CFinTech/SparseSSM.git
cd SparseSSM
pip install -r requirements.txt
The data for calibrations can be downloaded here.
To prune the SSM module:
CUDA_VISIBLE_DEVICES=${your_gpu_id} python main.py \
path/to/your/model wikitext2 \
--experiment_name your_experiment_name\
--method "sparsessm" \
--save path/to/pruned_model \
--sparsity 0.5 \
--nsamples 64 \
--minlayer 0 \
--maxlayer 100 \
--prune_A True \
--log_wandb \
To prune the FFN components:
CUDA_VISIBLE_DEVICES=${your_gpu_id} python main.py \
path/to/your/model wikitext2 \
--experiment_name your_experiment_name\
--method "sparsessm" \
--save path/to/pruned_model \
--sparsity 0.5 \
--nsamples 64 \
--minlayer 0 \
--maxlayer 100 \
--blocksize 128 \
--target_modules "nn.Conv1d" "nn.Linear" \
--prune_layer True \
--alpha 0.04 \
--log_wandb \
Illustration of SparseSSM. The first row depicts the evolution of the diagonal parameter matrix
Performance analysis for one-shot unstructured pruning of SSM modules in Mamba models (130M
Performance analysis for one-shot unstructured pruning of the whole Mamba models (130M
- This source code is derived from the famous PyTorch reimplementation of SparseGPT and mamba-minimal.
- We use Mamba checkpoints to test our method.
- The README file is inspired by LLM-pruner.
If you find this work useful for your research, please consider citing our paper:
@article{tuo2025sparsessm,
title={SparseSSM: Efficient Selective Structured State Space Models Can Be Pruned in One-Shot},
author={Kaiwen Tuo and Huan Wang},
journal={arXiv preprint arXiv:2506.09613},
year={2025},
}