EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Haotian Wang1, Yuzhe Weng1, Yueyan Li2, Zilu Guo1, Jun Du1, Shutong Niu1, Jiefeng Ma1, Shan He3, Xiaoyan Wu3, Qiming Hu3, Bing Yin3, Cong Liu3, Qingfeng Liu3
NERCSLIP, University of Science of Technology of China
Imperial College London
iFLYTEK CO.LTD.

Abstract

Abstract Image
Figure1.We propose EmotiveTalk, an expressive talking head generation framework. Taking a single portrait and the driven audio as input, our method can generate expressive portrait video that syncs with the audio and customizes the speaking style with additional emotion control.

Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation, yielding state-of-the-art performance.

Method

Method Image
Figure2.The framework of EmotiveTalk.

We propose a Vision-guided Audio Information Decouple (V-AID) approach to facilitate the decoupling of lip and expression related information contained in audio signals and also the alignment of audio representations with video representations under the guidance of vision facial motion information. Specifically, to achieve better alignment between speech and expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module, which generates temporal expression-related representations from audio under utterance emotional conditions from multiple optional driven sources. Then, to effectively drive the decoupled representations, we propose an efficient video diffusion framework for expressive talking head generation which demonstrates effectiveness and enhanced stability in talking head video generation performance. The backbone incorporates an Expression Decoupling Injector (EDI) module in our backbone to achieve the automatic decoupling of expression information from the reference portrait while facilitating the injection of expression-driven information.

Audio-Driven Talking Head Generation

Expressive Talking Generation

Vocal Source: TED Speech
Vocal Source: Christiana Figueres
Vocal Source: TED Speech

Multiple Styles Generation

Song: Love the Way You Lie, Rihanna
Speech: Female Education, Emily Blunt
Talk: TED Talk

Cross Actors Generation

Different Resolutions Generation (512-resolution and 1024-resolution)

Different Languages Performance (e.g. Chinese, French and Cantonese)

Vocal Source: Chinese, Talk
Vocal Source: French, Talk
Vocal Source: Cantonese, Broadcast

Emotion Controllable Generation

 

Different Emotions Generation (e.g. Neutral, Angry, Happy, Surprised)

Image-source/Text-source Emotion Control

Image Prompt: An Angry Image
Text Prompt: Be cheerful and say that sentence

Comparison with SOTA Methods

 

Overall Comparison

Methods Driven HDTF / MEAD
FID (↓) FVD (↓) Sync-C (↑) Sync-D (↓) E-FID (↓)
SadTalker A 22.34 / 36.88 589.63 / 132.27 7.75 / 6.46 7.36 / 8.07 0.66 / 1.14
Anitalker A 51.66 / 68.01 583.70 / 941.49 7.73 / 6.76 7.43 / 7.64 1.11 / 1.11
Aniportrait A 17.71 / 42.43 676.30 / 379.08 3.75 / 2.30 10.63 / 12.38 1.21 / 2.69
Hallo A 17.15 / 52.07 276.31 / 210.56 7.99 / 7.45 7.50 / 7.47 0.65 / 0.60
Ours A 16.64 / 53.21 140.96 / 207.67 8.24 / 6.82 7.09 / 7.43 0.54 / 0.57
PD-FGC A+V 67.97 / 121.46 464.90 / 353.75 7.30 / 5.15 7.72 / 8.77 0.74 / 1.92
StylkTalk A+V 29.65 / 118.48 184.60 / 197.18 4.34 / 3.86 10.35 / 10.74 0.42 / 0.56
DreamTalk A+V 29.37 / 105.92 263.78 / 204.48 6.80 / 5.64 8.03 / 8.69 0.55 / 0.87
Ours A+V 16.09 / 50.84 120.70 / 153.71 8.41 / 6.79 7.11 / 7.58 0.34 / 0.40
Ground Truth A+V - - 8.63 / 7.30 6.75 / 8.31 -

Case Study on Audio-Driven Generation

Case Study on Emotion Control (e.g. Neutral to Happy)

Voice Source: HDTF Dataset

BibTeX

@misc{wang2024emotivetalkexpressivetalkinghead,
  title={EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion}, 
  author={Haotian Wang and Yuzhe Weng and Yueyan Li and Zilu Guo and Jun Du and Shutong Niu and Jiefeng Ma and Shan He and Xiaoyan Wu and Qiming Hu and Bing Yin and Cong Liu and Qingfeng Liu},
  year={2024},
  eprint={2411.16726},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2411.16726}, 
}