DiMeR: Disentangled Mesh Reconstruction Model

1HKUST(GZ), 2HKUST, 3Noah's Ark Lab
*Equal Contribution, Corresponding Author
ljiang553@connect.hkust-gz.edu.cn, yingcongchen@ust.hk
DiMeR teaser figure

Abstract

We propose DiMeR, a novel geometry-texture disentangled feed-forward model with 3D supervision for sparse-view mesh reconstruction. Existing methods confront two persistent obstacles: (i) textures can conceal geometric errors, i.e., visually plausible images can be rendered even with wrong geometry, producing multiple ambiguous optimization objectives in geometry-texture mixed solution space for similar objects; and (ii) prevailing mesh extraction methods are redundant, unstable, and lack 3D supervision.

To solve these challenges, we rethink the inductive bias for mesh reconstruction. First, we disentangle the unified geometry-texture solution space, where a single input admits multiple feasible solutions, into geometry and texture spaces individually. Specifically, given that normal maps are strictly consistent with geometry and accurately capture surface variations, the normal maps serve as the sole input for geometry prediction in DiMeR, while the texture is estimated from RGB images.

Second, we streamline the algorithm of mesh extraction by eliminating modules with low performance/cost ratios and redesigning regularization losses with 3D supervision. Notably, DiMeR still accepts raw RGB images as input by leveraging foundation models for normal prediction. Extensive experiments demonstrate that DiMeR generalises across sparse-view-, single-image-, and text-to-3D tasks, consistently outperforming baselines. On the GSO and OmniObject3D datasets, DiMeR significantly reduces Chamfer Distance by more than 30%.

Method

DiMeR Framework

The core idea of DiMeR is to identify the necessary inductive biases for different stages of 3D mesh reconstruction. Specifically, geometry reconstruction does not require texture information, as RGB textures often introduce ambiguities in important geometric cues. Leveraging the inductive bias that normal maps are inherently consistent with realistic geometry, we focus on learning the geometry reconstruction solely from normal maps.

Application

Interactive 3D model viewer - you can use the mouse to drag and drop to rotate, the scroll wheel to zoom in and out, and the Shift key to pan.

Text-to-3D

A battle mech in a mix of red, blue, and black color, with a cannon on the head.

Detailed facial sculpt, horned head, tapered horns, deep set eyes, prominent cheekbones, furrowed brow.

Pink teapot model symmetrical, curved spout, rounded body, flat base, circular lid, elongated handle, tapered top.

Charlie Brown, a cartoon character in a yellow and black outfit, upright posture.

A person wearing a virtual reality headset, sitting position, bent legs, clasped hands.

A pink frog wearing a green hat and bow tie, humanoid shape, bulbous hat.

Image-to-3D

Input image 1

3D reconstruction from the input image

Input image 2

3D reconstruction from the input image

Input image 3

3D reconstruction from the input image

Input image 4

3D reconstruction from the input image

BibTeX


      @article{jiang2025dimer,
        title={DiMeR: Disentangled Mesh Reconstruction Model},
        author={Jiang, Lutao and Lin, Jiantao and Chen, Kanghao and Ge, Wenhang and Yang, Xin and Jiang, Yifan and Lyu, Yuanhuiyi and Zheng, Xu and Chen, Yingcong},
        journal={arXiv preprint arXiv:2504.17670},
        year={2025}
      }