Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling

ECCV 2024

Jaehyeok Kim, Dongyoon Wee, Dan Xu, ✉️
Department of CSE, HKUST   NAVER Cloud Corp.
✉️ Corresponding author

Abstract

img

We introduce Motion-oriented Compositional Neural Radiance Fields (MoCo-NeRF), a framework for free-viewpoint rendering of monocular human videos via novel non-rigid motion modeling approach. For dynamic clothed humans, complex cloth dynamics create non-rigid motions that are intrinsically distinct from skeletal articulations and critically important for rendering quality. Conventionally, they are modeled as spatial (3D) deviations based on skeletal transformations. However, it is either time-consuming or challenging to achieve optimal quality due to its high learning complexity.

To target this problem, we propose a novel approach of modeling non-rigid motions as radiance residual fields to benefit from more direct color supervision in the rendering and utilize the rigid radiance fields as a prior to reduce the complexity of the learning process. Our approach utilizes a single multiresolution hash encoding (MHE) to concurrently learn the canonical T-pose representation from rigid skeletal motions and the radiance residual field for non-rigid motions.

Additionally, to further improve both training efficiency and usability, we extend MoCo-NeRF to support simultaneous training of multiple subjects within a single framework, thanks to our effective design for modeling non-rigid motions. This scalability is achieved through the integration of a global MHE and learnable identity codes in addition to multiple local MHEs. We present extensive results on ZJU-MoCap and MonoCap, clearly demonstrating state-of-the-art performance in both single- and multi-subject settings. The code and model will be made publicly available.

Single-subject Comparisons (Per-subject Training)

Ours GauHuman [1]
Ours Instant-NVR [2]
Ours HumanNeRF [3]
Ours GauHuman [1]
Ours Instant-NVR [2]
Ours HumanNeRF [3]
Ours GauHuman [1]
Ours Instant-NVR [2]
Ours HumanNeRF [3]

Multi-subject Comparisons (Unified Training)

Ours GauHuman [1]
Ours Instant-NVR [2]
Ours HumanNeRF [3]
Ours GauHuman [1]
Ours Instant-NVR [2]
Ours HumanNeRF [3]
Ours GauHuman [1]
Ours Instant-NVR [2]
Ours HumanNeRF [3]

Pipeline

img

Overview of the proposed framework MoCo-NeRF for free-view human rendering from a monocular video. Without estimating a geometrical offset of each continuous canonical point for each body pose, our framework is able to handle all deformations via its radiance-compositional approach with a single MHE and achieve state-of-the-art performance. The pose-embedded implicit feature further enhances the learning of non-rigid radiance residuals by enabling pose-distinctive representation learning.


img

Pose-embedded implicit feature generation: we employ cross-attention to modulate the single learnable base code to pose-adaptive features.

img

Extensions of MoCo-NeRF for multi-subject learning: global MHE, set of $N$ local MHEs, and dictionary of learnable base codes as ID codes.

References
[1] Hu, S., Liu, Z.: GauHuman: Articulated gaussian splatting from monocular human videos. In: CVPR (2024)
[2] Geng, C., Peng, S., Xu, Z., Bao, H., Zhou, X.: Learning neural volumetric representations of dynamic humans in minutes. In: CVPR (2023)
[3] Weng, C.Y., Curless, B., Srinivasan, P.P., Barron, J.T., Kemelmacher-Shlizerman, I.: HumanNeRF: Free-viewpoint rendering of moving people from monocular video. In: CVPR (2022)

BibTeX

@inproceedings{kim2024moconerf,
    title={Motion-Oriented Compositional Neural Radiance Fields for Monocular Dynamic Human Modeling},
    author={Kim, Jaehyeok and Wee, Dongyoon and Xu, Dan},
    booktitle={ECCV},
    year={2024}
}