CASIA OpenIR  > 多模态人工智能系统全国重点实验室
Visually Guided Sound Source Separation With Audio-Visual Predictive Coding
Song, Zengjie1; Zhang, Zhaoxiang2,3
Source PublicationIEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
ISSN2162-237X
2023-07-12
Pages15
Corresponding AuthorZhang, Zhaoxiang(zhaoxiang.zhang@ia.ac.cn)
AbstractThe framework of visually guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual feature extractor for informative visual guidance and separately devise module for feature fusion, while utilizing U-Net by default for sound analysis. However, such a divide-and-conquer paradigm is parameter-inefficient and, meanwhile, may obtain suboptimal performance as jointly optimizing and harmonizing various model components is challengeable. By contrast, this article presents a novel approach, dubbed audio-visual predictive coding (AVPC), to tackle this task in a parameter-efficient and more effective manner. The network of AVPC features a simple ResNet-based video analysis network for deriving semantic visual features, and a predictive coding (PC)-based sound separation network that can extract audio features, fuse multimodal information, and predict sound separation masks in the same architecture. By iteratively minimizing the prediction error between features, AVPC integrates audio and visual information recursively, leading to progressively improved performance. In addition, we develop a valid self-supervised learning strategy for AVPC via copredicting two audio-visual representations of the same sound source. Extensive evaluations demonstrate that AVPC outperforms several baselines in separating musical instrument sounds, while reducing the model size significantly. Code is available at: https://github.com/zjsong/Audio-Visual-Predictive-Coding.
KeywordFeature fusion multimodal learning predictive coding (PC) self-supervised learning sound source separation
DOI10.1109/TNNLS.2023.3288022
WOS KeywordINTEGRATION
Indexed BySCI
Language英语
Funding ProjectMajor Project for New Generation of AI[2018AAA0100400] ; National Natural Science Foundation of China[61836014] ; National Natural Science Foundation of China[U21B2042] ; National Natural Science Foundation of China[62072457] ; National Natural Science Foundation of China[62006231] ; National Natural Science Foundation of China[61976174] ; China Postdoctoral Science Foundation[2021M703489]
Funding OrganizationMajor Project for New Generation of AI ; National Natural Science Foundation of China ; China Postdoctoral Science Foundation
WOS Research AreaComputer Science ; Engineering
WOS SubjectComputer Science, Artificial Intelligence ; Computer Science, Hardware & Architecture ; Computer Science, Theory & Methods ; Engineering, Electrical & Electronic
WOS IDWOS:001030674000001
PublisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Citation statistics
Document Type期刊论文
Identifierhttp://ir.ia.ac.cn/handle/173211/53784
Collection多模态人工智能系统全国重点实验室
Corresponding AuthorZhang, Zhaoxiang
Affiliation1.Xi An Jiao Tong Univ, Sch Math & Stat, Xian 710049, Peoples R China
2.Chinese Acad Sci, Inst Automat, Ctr Res Intelligent Percept & Comp, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
3.Chinese Acad Sci, Hong Kong Inst Sci & Innovat, Ctr Artificial Intelligence & Robot, Hong Kong, Peoples R China
Corresponding Author AffilicationChinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
Recommended Citation
GB/T 7714
Song, Zengjie,Zhang, Zhaoxiang. Visually Guided Sound Source Separation With Audio-Visual Predictive Coding[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS,2023:15.
APA Song, Zengjie,&Zhang, Zhaoxiang.(2023).Visually Guided Sound Source Separation With Audio-Visual Predictive Coding.IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS,15.
MLA Song, Zengjie,et al."Visually Guided Sound Source Separation With Audio-Visual Predictive Coding".IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2023):15.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Song, Zengjie]'s Articles
[Zhang, Zhaoxiang]'s Articles
Baidu academic
Similar articles in Baidu academic
[Song, Zengjie]'s Articles
[Zhang, Zhaoxiang]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Song, Zengjie]'s Articles
[Zhang, Zhaoxiang]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.