Project Page / Preprint 2026

MeMix logo MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction

1Institute for AI Industry Research, Tsinghua University 2Zhejiang University

* Equal contribution. † Corresponding author.

TL;DR: Training-free selective memory updates for long-horizon recurrent streaming 3D reconstruction.

Training-free O(1) inference memory Bottom-k routing 15.3% average completeness error reduction on 7-Scenes

Abstract

Reconstruction is a fundamental task in 3D vision and a fundamental capability for spatial intelligence. Particularly, streaming 3D reconstruction is central to real-time spatial perception, yet existing recurrent online models often suffer from progressive degradation on long sequences due to state drift and forgetting, motivating inference-time remedies. We present MeMix, a training-free, plug-and-play module that improves streaming reconstruction by recasting the recurrent state into a Memory Mixture. MeMix partitions the state into multiple independent memory patches and updates only the least-aligned memory patches while exactly preserving others. This selective update mitigates catastrophic forgetting while retaining O(1) inference memory, and requires no fine-tuning or additional learnable parameters, making it directly applicable to existing recurrent reconstruction models. Across standard benchmarks (ScanNet, 7-Scenes, KITTI, etc.), under identical backbones and inference settings, MeMix reduces reconstruction completeness error by 15.3% on average (up to 40.0%) across 300-500 frame streams on 7-Scenes.

Overview of MeMix

Overview figure of MeMix with a ViT encoder, dual-stream decoder, and Bottom-k state updates.
A ViT encoder encodes each frame to tokens $\mathbf{X}_t$, which interact with state tokens $\mathbf{S}_{t-1}$ through a dual-stream cross-attention decoder to produce predictions $\mathbf{Y}_t$ and candidate state $\widehat{\mathbf{S}}_t$. MeMix computes dot scores between $\widehat{\mathbf{S}}_t$ and ${\mathbf{X}_t}$, selects \texttt{Bottom-k} patches to construct a binary mask $\mathbf{M}_t$, updating only \texttt{Bottom-K} patches. Decoded image tokens $\mathbf{Y}_t$ are fed to the prediction head for output.

Interactive Examples

Current Scene

Office Seq-07

CUT / CUT w.MeMix · 300 views · 1200 pts/frame

o.MeMix

Loading 3D scene...

w.MeMix

Loading 3D scene...

Results are downsampled for efficient online rendering, with each frame capped at 1200 points, and camera motion is synced with ~100ms delay.

Experiments

All tables transcribed from the paper and supplementary material.

Table 1. Unified Memory Update Rules

Shared gate formulation for CUT3R, TTT3R, TTSA3R, and their MeMix variants.

Model Memory Update Rule
Unified form $S_t = G_t \odot \widehat{S}_t + (1 - G_t) \odot S_{t-1}$
CUT3R $G_t = 1$
TTT3R / TTSA3R $G_t = \beta_t$
CUT3R + MeMix $G_t = M_t$
TTT3R / TTSA3R + MeMix $G_t = M_t \odot \beta_t$

BibTeX

@misc{dong2026memix,
  title  = {MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction},
  author = {Jiacheng Dong and Huan Li and Sicheng Zhou and Wenhao Hu and Weili Xu and Yan Wang},
  year   = {2026},
  note   = {Preprint}
}