Icon OpenDance: Multimodal Controllable 3D Dance Generation Using Large-scale Internet Data

Anonymous Submission
Project Teaser Image

We collect a multi-modal large-scale human dance dataset OpenDance5D and develop the OpenDanceNet for controllable and flexible multimodal generation conditioned on any "Music+X" condition combination (X: 2D keypoints, global position, and fine-grained text prompts).


Abstract

Music-driven dance generation offers significant creative potential yet faces considerable challenges. The absence of fine-grained multimodal data and the difficulty of flexible multi-conditional generation limit previous works on generation controllability and diversity in practice. In this paper, we build OpenDance5D, an extensive human dance dataset comprising over 101 hours across 14 distinct genres. Each sample has five modalities to facilitate robust cross-modal learning: RGB video, audio, 2D keypoints, 3D motion, and fine-grained textual descriptions from human arts. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation conditioned on music and arbitrary combinations of text prompts, keypoints, or character positioning. Comprehensive experiments demonstrate that OpenDanceNet achieves high-fidelity and flexible controllability.



OpenDance5D Dataset

Project Teaser Image

Data Distribution of OpenDance5D in terms of (a) dancers and (b) genres. The violin plot shows the number of samples per dancer in raw video data, while the sunburst chart illustrates the distribution of samples across 14 dance genres.



OpenDanceNet

Project Teaser Image

The masked modeling based dance generation framework OpenDanceNet. During training, we input multiple user-customized conditions to generate controllable dance results. The transformer-based diffusion network is employed as body diffuser and hand diffuser, respectively.



Dataset Demos

Visualization case #1


Visualization case #2


Visualization case #3


Visualization case #4


Multimodal Condition Controllability

🎵 Videos with music (sound on) 🎵

Music only


Music + All Conditions


Music + Kpts2D


Music + Global Position


Music + User-defined Signals

(More demos coming soon...)

🎵 Videos with music (sound on) 🎵

Straight Line (Input: start + end positions)


Circle Line (Input: start position, radius, and center point)


Different Characters Dance Now

🎵 Videos with music (sound on) 🎵


OpenDanceNet Generation Demos

🎵 Videos with music (sound on) 🎵


(trained on OpenDance5D dataset, with multimodal condition)


(trained on AIST++ dataset, without multimodal condition)