Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation, yielding state-of-the-art performance.
We propose a Vision-guided Audio Information Decouple (V-AID) approach to facilitate the decoupling of lip and expression related information contained in audio signals and also the alignment of audio representations with video representations under the guidance of vision facial motion information. Specifically, to achieve better alignment between speech and expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module, which generates temporal expression-related representations from audio under utterance emotional conditions from multiple optional driven sources. Then, to effectively drive the decoupled representations, we propose an efficient video diffusion framework for expressive talking head generation which demonstrates effectiveness and enhanced stability in talking head video generation performance. The backbone incorporates an Expression Decoupling Injector (EDI) module in our backbone to achieve the automatic decoupling of expression information from the reference portrait while facilitating the injection of expression-driven information.
Methods | Driven | HDTF / MEAD | ||||
---|---|---|---|---|---|---|
FID (↓) | FVD (↓) | Sync-C (↑) | Sync-D (↓) | E-FID (↓) | ||
SadTalker | A | 22.34 / 36.88 | 589.63 / 132.27 | 7.75 / 6.46 | 7.36 / 8.07 | 0.66 / 1.14 |
Anitalker | A | 51.66 / 68.01 | 583.70 / 941.49 | 7.73 / 6.76 | 7.43 / 7.64 | 1.11 / 1.11 |
Aniportrait | A | 17.71 / 42.43 | 676.30 / 379.08 | 3.75 / 2.30 | 10.63 / 12.38 | 1.21 / 2.69 |
Hallo | A | 17.15 / 52.07 | 276.31 / 210.56 | 7.99 / 7.45 | 7.50 / 7.47 | 0.65 / 0.60 |
Ours | A | 16.64 / 53.21 | 140.96 / 207.67 | 8.24 / 6.82 | 7.09 / 7.43 | 0.54 / 0.57 |
PD-FGC | A+V | 67.97 / 121.46 | 464.90 / 353.75 | 7.30 / 5.15 | 7.72 / 8.77 | 0.74 / 1.92 |
StylkTalk | A+V | 29.65 / 118.48 | 184.60 / 197.18 | 4.34 / 3.86 | 10.35 / 10.74 | 0.42 / 0.56 |
DreamTalk | A+V | 29.37 / 105.92 | 263.78 / 204.48 | 6.80 / 5.64 | 8.03 / 8.69 | 0.55 / 0.87 |
Ours | A+V | 16.09 / 50.84 | 120.70 / 153.71 | 8.41 / 6.79 | 7.11 / 7.58 | 0.34 / 0.40 |
Ground Truth | A+V | - | - | 8.63 / 7.30 | 6.75 / 8.31 | - |
@misc{wang2024emotivetalkexpressivetalkinghead,
title={EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion},
author={Haotian Wang and Yuzhe Weng and Yueyan Li and Zilu Guo and Jun Du and Shutong Niu and Jiefeng Ma and Shan He and Xiaoyan Wu and Qiming Hu and Bing Yin and Cong Liu and Qingfeng Liu},
year={2024},
eprint={2411.16726},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.16726},
}