多尺度視覺增強語音驅(qū)動人臉生成
關(guān)鍵詞:語音驅(qū)動;人臉生成;視覺增強;視覺質(zhì)量
中圖分類號:TP391文獻(xiàn)標(biāo)志碼:A
DOI:10.7652/xjtuxb202506017 文章編號:0253-987X(2025)06-0167-10
Audio-Driven Talking Face Generation with Multi-Scale Visual Enhancement
YANG Xiangyan1,LIANGHuihui2,CHEN Xi,LIFan2
(1. School of Computer Science and Technology,Xinjiang University,Urumqi 83o046,China; .Faculty of Electronic and Information Engineering,Xi'an Jiaotong University,Xi'an 71oo49,China)
Abstract: To address the limitations of existing audio-driven talking face generation methods in terms of video clarity and realism,an end-to-end talking face generation method called VisClearTalk which incorporates multi-scale visual enhancement is proposed in this paper,and a face decoder with a visual enhancement module is proposed. First, the face encoder processed a random reference frame and a prior frame with the lower half of the face occluded to extract facial features. Simultaneously, the audio encoder extracted features from the audio to guide facial content generation. Subsequently, the face decoder integrated these features and performed an initial reconstruction of facial images through convolutional modules.Finall,the visual enhancement module employed multi-scale convolution and residual fusion to further enhance the details and edge information of the lower face region,improving the visual quality of the generated talking face videos. The VisClearTalk model was experimentally validated using public lip-reading datasets,with both quantitative and qualitative results demonstrating that the introduction of the visual enhancement module effectively improves the fineness and realism of facial visual content, enabling the generation of clear and natural talking face videos. In terms of performance metrics, the peak signal-to-noise ratio reached 34.349 dB, structural similarity reached O.933,and learnable perceptual image patch similarity was reduced to O. 040. The VisClearTalk model offers a viable solution for current talking face videos generation needs.
Keywords: audio-driven; talking face generation; visual enhancement; visual quality
語音驅(qū)動人臉生成是視聽領(lǐng)域的重要研究課題之一[],其能夠?qū)⒁曈X和聽覺信息有機整合,增強人類對信息的理解和感知。(剩余14292字)
-
-
- 西安交通大學(xué)學(xué)報
- 2025年06期
- 氨/氫混合燃料超燃沖壓發(fā)動機模...
- 噴射壓力對甲醇缸內(nèi)直噴發(fā)動機燃...
- 固體火箭發(fā)動機碳/碳復(fù)合材料噴...
- 壓縮比與點火正時對氫燃料橢圓轉(zhuǎn)...
- 含電磁敏感鐵絲推進(jìn)劑的制備及其...
- 萘四甲酸二酐改性的聚醚酰亞胺共...
- 膨脹石墨和碳納米管涂層對相變材...
- β-磷酸鈣增強鋅合金激光選區(qū) ...
- 用于骨修復(fù)中可降解生物陶瓷的制...
- 采用堆疊長短期記憶神經(jīng)網(wǎng)絡(luò)的水...
- 神經(jīng)算子增強的雙級低壓渦輪子午...
- 融合U-net網(wǎng)絡(luò)的純卷積視頻...
- 過熱蒸汽管道噴霧冷卻特性數(shù)值分...
- 燃?xì)廨啓C拉桿轉(zhuǎn)子跨尺度接觸界面...
- 采用SHAP的高壓渦輪級高維設(shè)...
- 結(jié)合點云距離和角度雙閾值的 橋...
- 多尺度視覺增強語音驅(qū)動人臉生成...
- 利用兩側(cè)邊線空間幾何關(guān)系的單幅...
- 油電混合-機液復(fù)合動力傳動系統(tǒng)...
- 時滯對半主動懸架不同控制策略的...
- 壓電變壓器結(jié)構(gòu)的便攜式磁電耦合...