MeMix logo MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction

1Institute for AI Industry Research, Tsinghua University 2Zhejiang University

* Equal contribution. † Corresponding author.

TL;DR: Training-free selective memory updates for long-horizon recurrent streaming 3D reconstruction.

Abstract

Reconstruction is a fundamental task in 3D vision and a fundamental capability for spatial intelligence. Particularly, streaming 3D reconstruction is central to real-time spatial perception, yet existing recurrent online models often suffer from progressive degradation on long sequences due to state drift and forgetting, motivating inference-time remedies. We present MeMix, a training-free, plug-and-play module that improves streaming reconstruction by recasting the recurrent state into a Memory Mixture. MeMix partitions the state into multiple independent memory patches and updates only the least-aligned memory patches while exactly preserving others. This selective update mitigates catastrophic forgetting while retaining O(1) inference memory, and requires no fine-tuning or additional learnable parameters, making it directly applicable to existing recurrent reconstruction models. Across standard benchmarks (ScanNet, 7-Scenes, KITTI, etc.), under identical backbones and inference settings, MeMix reduces reconstruction completeness error by 15.3% on average (up to 40.0%) across 300-500 frame streams on 7-Scenes.

Overview of MeMix

Overview figure of MeMix with a ViT encoder, dual-stream decoder, and Bottom-k state updates.
A ViT encoder encodes each frame to tokens $\mathbf{X}_t$, which interact with state tokens $\mathbf{S}_{t-1}$ through a dual-stream cross-attention decoder to produce predictions $\mathbf{Y}_t$ and candidate state $\widehat{\mathbf{S}}_t$. MeMix computes dot scores between $\widehat{\mathbf{S}}_t$ and ${\mathbf{X}_t}$, selects $\texttt{Bottom-k}$ patches to construct a binary mask $\mathbf{M}_t$, updating only $\texttt{Bottom-K}$ patches. Decoded image tokens $\mathbf{Y}_t$ are fed to the prediction head for output.

Interactive Examples

Current Scene

Office Seq-07

CUT w/o MeMix / CUT w/ MeMix · 300 views

w/o MeMix

Loading 3D scene...

w/ MeMix

Loading 3D scene...
Left DragOrbit WheelZoom Right DragPan W A S DPan Hover / ClickActivate Panel

Results are downsampled for efficient online rendering, with each frame capped at 1200 points, and camera motion is synced with ~100ms delay.

Experiments

MeMix delivers consistent gains across multiple recurrent streaming 3D reconstruction backbones.

Green cells indicate that w/ MeMix matches or outperforms the corresponding backbone. Please refer to the paper for the complete tables.

Table 2. Sparse 3D Reconstruction Results

Representative results on 7-Scenes-S, evaluated with 300 views.

Method MeMix Acc Mean ↓ Comp Mean ↓ NC Mean ↑
CUT3R [3] × 0.141 0.076 0.543
CUT3R [3] 0.106 0.053 0.550
TTT3R [11] × 0.040 0.024 0.567
TTT3R [11] 0.034 0.023 0.567
TTSA3R [15] × 0.036 0.035 0.566
TTSA3R [15] 0.026 0.021 0.568

BibTeX

@misc{dong2026memix,
  title  = {MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction},
  author = {Jiacheng Dong and Huan Li and Sicheng Zhou and Wenhao Hu and Weili Xu and Yan Wang},
  year   = {2026},
  note   = {Preprint}
}