ICON Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

Xingyu Chen1     Yue Chen1     Yuliang Xiu1,2     Andreas Geiger3     Anpei Chen1,3    
1Westlake University      2Max Planck Institute for Intelligent Systems      3University of Tübingen, Tübingen AI Center

Comparison

Here, we visualize cross-frame aligned static scenes with dynamic point clouds at a selected timestamp.
Instead of using GT dynamic masks, we use the estimated dynamic masks to filter out points at other timestamps.
Click tabs below to explore the results for each baseline.

Comparison Method:

Our method achieves superior structure alignment and fewer artifacts, owing to the robust dynamic segmentation estimation.

MonST3R

Ours

dog-gooses schoolgirls sheep drift-chicane

Results are downsampled 10 times for efficient online rendering

Left-click and drag to rotate
Right-click and drag or WASD to move
Scroll to zoom
Click to pause

Our method provides clean reconstructions, while DAS3R suffers from structure misalignment and ghosting artifacts due to inaccuracies in dynamic segmentation estimation. For example, it under-segments the dog and goose.

DAS3R

Ours

dog-gooses schoolgirls sheep drift-chicane

Results are downsampled 10 times for efficient online rendering

Left-click and drag to rotate
Right-click and drag or WASD to move
Scroll to zoom
Click to pause

CUT3R lacks support for dynamic mask estimation, leading to the blending of points from different frames when ground truth masks are not used. Additionally, our approach demonstrates greater reliability in achieving accurate camera poses.

CUT3R

Ours

dog-gooses schoolgirls sheep drift-chicane

Results are downsampled 10 times for efficient online rendering

Left-click and drag to rotate
Right-click and drag or WASD to move
Scroll to zoom
Click to pause