Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model

KAIST
main figure

Text2Control3D can generate a 3D avatar given a text description and a casually captured monocular video for conditional control of facial expression and shape.

Abstract

Recent advances in diffusion models such as ControlNet have enabled geometrically controllable, high-fidelity text-to-image generation. However, none of them addresses the question of adding such controllability to text-to-3D generation.

In response, we propose Text2Control3D, a controllable text-to-3D avatar generation method whose facial expression is controllable given a monocular video casually captured with hand-held camera. Our main strategy is to construct the 3D avatar in Neural Radiance Fields (NeRF) optimized with a set of controlled viewpoint-aware images that we generate from ControlNet, whose condition input is the depth map extracted from the input video.

When generating the viewpoint-aware images, we utilize cross-reference attention to inject well-controlled, referential facial expression and appearance via cross attention. We also conduct low-pass filtering of Gaussian latent of the diffusion model in order to ameliorate the viewpoint-agnostic texture problem we observed from our empirical analysis, where the viewpoint-aware images contain identical textures on identical pixel positions that are incomprehensible in 3D. Finally, to train NeRF with the images that are viewpoint-aware yet are not strictly consistent in geometry, our approach considers per-image geometric variation as a view of deformation from a shared 3D canonical space. Consequently, we construct the 3D avatar in a canonical space of deformable NeRF by learning a set of per-image deformation via deformation field table. We demonstrate the empirical results and discuss the effectiveness of our method.

Method

method figure

Text2Control3D takes a text description (e.g. ”Elon Musk”, ”A handsome white man with short-fade haircut”), and frames of a monocular video for facial expression and shape control.

We first extract viewpoint-augmented depth maps from the monocular video. Then, we use these depth maps as conditions for generating viewpoint-aware avatar images using ControlNet that is enhanced with two of our methods: (1) cross-reference attention to achieve controllability on facial expression and appearance across the viewpoint-aware generations, and (2) low-pass filtering of Gaussian latent to remove view-agnostic textures that break 3D consistency. Finally, as these images are viewpointaware yet are not strictly consistent in geometry, we suggest to regard the images as views of per-image deformation of 3D canonical space, and learn deformation field table to finally construct the 3D avatar in canonical space of deformable NeRF.

BibTeX

@misc{hwang2023text2control3d,
      title={Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model}, 
      author={Sungwon Hwang and Junha Hyung and Jaegul Choo},
      year={2023},
      eprint={2309.03550},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}