DiSPo: Diffusion-SSM based Policy Learning for Coarse-to-Fine Action Discretization

Korea Advanced Institute of Science and Technology
ICRA 2026

Diffusion-SSM based policy for learning from / generating multi-granularity actions.

Abstract

We aim to solve the problem of learning user-intended granular skills from multi-granularity demonstrations. Traditional learning-from-demonstration methods typically rely on extensive fine-grained data, interpolation techniques, or dynamics models, which are ineffective at encoding or decoding the diverse granularities inherent in skills. To overcome it, we introduce a novel diffusion-state space model-based policy (DiSPo) that leverages a state-space model, Mamba, to learn from diverse coarse demonstrations and generate multi-scale actions. Our proposed step-scaling mechanism in Mamba is a key innovation, enabling memory-efficient learning, flexible granularity adjustment, and robust representation of multi-granularity data. DiSPo outperforms state-of-the-art baselines on coarse-to-fine benchmarks, achieving up to an 81% improvement in success rates while enhancing inference efficiency by generating inexpensive coarse motions where applicable. We validate DiSPo's scalability and effectiveness on real-world manipulation scenarios. Code and Videos are available at https://robo-dispo.github.io.

Overview

dispo overview visualization
Overview of DiSPo: a diffusion-SSM based policy for coarse-to-fine imitation learning. Our proposed DiSPo framework generates fine-grained actions from coarse demonstrations, while the baseline method generates rough actions. Leveraging the representation power of diffusion policy and the flexible discretization capabilities of SSMs, DiSPo learns from multi-granularity demonstrations and generates actions at user-intended granularities.

Metho

Overall Architecture

dispo overall architecture visualization
Illustration of the DiSPo architecture. DiSPo takes diffusion step $k$, step-scale factors $\bf{r}_t$, encoded observations $\mathbf{o}_{t-T_o+1:t}$, and noisy actions $\mathbf{a}^{(k)}_{t-T_o+1:t+T_a}$. The model identifies the noise $\hat{\varepsilon}^{(k)}_{t-T_o+1:t+T_a}$ within the input noisy actions through stacked \texttt{DiSPo} blocks and utilizes the identified noise to generate the less noisy action $\mathbf{a}^{(k-1)}_{t-T_o+1:t+T_a}$.

Step-scaled SSM

step-scaled SSSM visualization
A step-scaled SSM takes input sequence $^{\ddagger}\mathbf{u}_t^{(i)}$ and $\bf{r}_t$ to scale $\Delta_t^{(i)}$, and discretizes the learned SSM parameters using the step sizes.

Results

Qualitative results

Quantitative results

DiSPo result 0

Fig. 1. Comparison of task success rates [%] across four frequencies of demonstrations per simulated benchmark. We train each method with a source frequency (x-axis) of demonstrations and test a 20Hz target frequency of actions in new environments. Note that Tracker is a complexity indicator.

DiSPo result 1

Fig. 2. Task success rates (%) under a mixed-frequency (2.5 and 5Hz) training dataset in the button touch task.

DiSPo result 2

Fig. 3. Comparison of DiSPo using fixed versus data-driven step-scaling factors from the predictor $\phi_r$. We normalize the number of action steps taken in successful cases by the maximum step limit allowed in the task.

Video Presentation

BibTeX

@inproceedings{oh2026dispo,
  title={DiSPo: Diffusion-SSM based Policy Learning for Coarse-to-Fine Action Discretization},
  author={Oh, Nayoung and Jang, Jaehyeong and Jung, Moonkyeong and Park, Daehyung},
  journal={Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
  year={2026},
}