多尺度視覺增強語音驅(qū)動人臉生成

打印
收藏

收藏成功

微博 QQ空間微信

打開文本圖片集

關(guān)鍵詞：語音驅(qū)動；人臉生成；視覺增強；視覺質(zhì)量

中圖分類號：TP391文獻(xiàn)標(biāo)志碼：A

DOI：10.7652/xjtuxb202506017 文章編號：0253-987X（2025）06-0167-10

Audio-Driven Talking Face Generation with Multi-Scale Visual Enhancement

YANG Xiangyan1，LIANGHuihui2，CHEN Xi，LIFan2

（1. School of Computer Science and Technology，Xinjiang University，Urumqi 83o046，China; .Faculty of Electronic and Information Engineering，Xi'an Jiaotong University，Xi'an 71oo49，China）

Abstract： To address the limitations of existing audio-driven talking face generation methods in terms of video clarity and realism，an end-to-end talking face generation method called VisClearTalk which incorporates multi-scale visual enhancement is proposed in this paper，and a face decoder with a visual enhancement module is proposed. First， the face encoder processed a random reference frame and a prior frame with the lower half of the face occluded to extract facial features. Simultaneously， the audio encoder extracted features from the audio to guide facial content generation. Subsequently， the face decoder integrated these features and performed an initial reconstruction of facial images through convolutional modules.Finall，the visual enhancement module employed multi-scale convolution and residual fusion to further enhance the details and edge information of the lower face region，improving the visual quality of the generated talking face videos. The VisClearTalk model was experimentally validated using public lip-reading datasets，with both quantitative and qualitative results demonstrating that the introduction of the visual enhancement module effectively improves the fineness and realism of facial visual content， enabling the generation of clear and natural talking face videos. In terms of performance metrics， the peak signal-to-noise ratio reached 34.349 dB， structural similarity reached O.933，and learnable perceptual image patch similarity was reduced to O. 040. The VisClearTalk model offers a viable solution for current talking face videos generation needs.

Keywords： audio-driven; talking face generation; visual enhancement; visual quality

語音驅(qū)動人臉生成是視聽領(lǐng)域的重要研究課題之一[]，其能夠?qū)⒁曈X和聽覺信息有機整合，增強人類對信息的理解和感知。（剩余14292字）

試讀結(jié)束

購買全文6.00元下一篇利用兩側(cè)邊線空間幾何關(guān)系的單幅圖像圓柱位姿實時估計

西安交通大學(xué)學(xué)報

2025年06期

￥4.00/本