MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction

Jiacheng Dong^2* Huan Li^1* Sicheng Zhou^2* Wenhao Hu² Weili Xu² Yan Wang^1†

¹Institute for AI Industry Research, Tsinghua University ²Zhejiang University

* Equal contribution. † Corresponding author.

TL;DR: Training-free selective memory updates for long-horizon recurrent streaming 3D reconstruction.

Abstract

Reconstruction is a fundamental task in 3D vision and a fundamental capability for spatial intelligence. Particularly, streaming 3D reconstruction is central to real-time spatial perception, yet existing recurrent online models often suffer from progressive degradation on long sequences due to state drift and forgetting, motivating inference-time remedies. We present MeMix, a training-free, plug-and-play module that improves streaming reconstruction by recasting the recurrent state into a Memory Mixture. MeMix partitions the state into multiple independent memory patches and updates only the least-aligned memory patches while exactly preserving others. This selective update mitigates catastrophic forgetting while retaining O(1) inference memory, and requires no fine-tuning or additional learnable parameters, making it directly applicable to existing recurrent reconstruction models. Across standard benchmarks (ScanNet, 7-Scenes, KITTI, etc.), under identical backbones and inference settings, MeMix reduces reconstruction completeness error by 15.3% on average (up to 40.0%) across 300-500 frame streams on 7-Scenes.

Overview of MeMix

Overview figure of MeMix with a ViT encoder, dual-stream decoder, and Bottom-k state updates. — A ViT encoder encodes each frame to tokens $\mathbf{X}_t$, which interact with state tokens $\mathbf{S}_{t-1}$ through a dual-stream cross-attention decoder to produce predictions $\mathbf{Y}_t$ and candidate state $\widehat{\mathbf{S}}_t$. MeMix computes dot scores between $\widehat{\mathbf{S}}_t$ and ${\mathbf{X}_t}$, selects $\texttt{Bottom-k}$ patches to construct a binary mask $\mathbf{M}_t$, updating only $\texttt{Bottom-K}$ patches. Decoded image tokens $\mathbf{Y}_t$ are fed to the prediction head for output.

Interactive Examples

Current Scene

Office Seq-07

CUT w/o MeMix / CUT w/ MeMix · 300 views

w/o MeMix

Loading 3D scene...

w/ MeMix

Loading 3D scene...

Left DragOrbit WheelZoom Right DragPan W A S DPan Hover / ClickActivate Panel

Results are downsampled for efficient online rendering, with each frame capped at 1200 points, and camera motion is synced with ~100ms delay.

Experiments

MeMix delivers consistent gains across multiple recurrent streaming 3D reconstruction backbones.

Green cells indicate that w/ MeMix matches or outperforms the corresponding backbone. Please refer to the paper for the complete tables.

Table 2. Sparse 3D Reconstruction Results

Representative results on 7-Scenes-S, evaluated with 300 views.

Method	MeMix	Acc Mean ↓	Comp Mean ↓	NC Mean ↑
CUT3R [3]	×	0.141	0.076	0.543
CUT3R [3]	✓	0.106	0.053	0.550
TTT3R [11]	×	0.040	0.024	0.567
TTT3R [11]	✓	0.034	0.023	0.567
TTSA3R [15]	×	0.036	0.035	0.566
TTSA3R [15]	✓	0.026	0.021	0.568

Table 5. Dense 3D Reconstruction Results

Representative results on 7-Scenes-D, evaluated with 300 views.

Method	MeMix	Acc Mean ↓	Comp Mean ↓	NC Mean ↑
CUT3R [3]	×	0.099	0.048	0.542
CUT3R [3]	✓	0.076	0.039	0.549
TTT3R [11]	×	0.030	0.019	0.558
TTT3R [11]	✓	0.030	0.019	0.559
TTSA3R [15]	×	0.023	0.018	0.558
TTSA3R [15]	✓	0.022	0.017	0.559

Table 6. Evaluation on Short-Sequence Pose Estimation

Selected pose metrics with clearer gains on TUM-dynamics, ScanNet, and Sintel.

Method	MeMix	TUM ATE ↓	TUM RPE rot ↓	ScanNet ATE ↓	Sintel RPE trans ↓
CUT3R [3]	×	0.045	0.443	0.096	0.069
CUT3R [3]	✓	0.043	0.424	0.090	0.075
TTT3R [11]	×	0.029	0.380	0.065	0.093
TTT3R [11]	✓	0.028	0.376	0.065	0.083
TTSA3R [15]	×	0.026	0.372	0.058	0.084
TTSA3R [15]	✓	0.025	0.372	0.057	0.084

Table 7. Video Depth Estimation

Selected per-sequence-scale metrics that most clearly show the depth gains of MeMix.

Method	MeMix	KITTI Abs Rel ↓	KITTI δ < 1.25 ↑	Sintel Abs Rel ↓	Sintel δ < 1.25 ↑
CUT3R [3]	×	0.116	88.1	0.426	47.3
CUT3R [3]	✓	0.115	88.6	0.436	46.2
TTT3R [11]	×	0.107	91.2	0.409	48.9
TTT3R [11]	✓	0.103	92.1	0.407	49.2
TTSA3R [15]	×	0.103	91.9	0.410	49.6
TTSA3R [15]	✓	0.103	92.2	0.400	50.2

Table 4. Efficiency

Inference FPS and peak GPU memory on KITTI. MeMix keeps GPU memory unchanged under the reported settings.

Method	FPS w/o	FPS w.	GPU w/o	GPU w.
CUT3R	14.39	14.13	5.31 GB	5.31 GB
TTT3R	12.72	12.81	6.96 GB	6.96 GB
TTSA3R	12.58	12.78	6.63 GB	6.63 GB

BibTeX

@misc{dong2026memix,
  title  = {MeMix: Writing Less, Remembering More for Streaming 3D Reconstruction},
  author = {Jiacheng Dong and Huan Li and Sicheng Zhou and Wenhao Hu and Weili Xu and Yan Wang},
  year   = {2026},
  note   = {Preprint}
}