Learning to Represent Audio: From Understanding to Guiding
By Changhong Wang

Changhong Wang will give a talk on Learning to Represent Audio: From Understanding to Guiding

Abstract

Humans can recognize complex sounds instantly, such as distinguishing between musical instruments or identifying voices in challenging conditions—abilities grounded in the architecture of the auditory system. But how can this capability be exploited to enable richer integration of audio with other modalities in the AI-immersive era? This talk presents collaborative work on audio representation learning, spanning from perceptually inspired representations to deep learning models. We introduce computational surrogates that serve as versatile frontends for audio classification and as differentiable models for time-frequency analysis. These models align with auditory perception and the physical attributes in sound production, validated using explainable AI (XAI) techniques. With the emergence of foundation models, we propose methods to quantify the sensitivity of pre-trained audio embeddings, and introduce strategies to improve their robustness and generalization. By leveraging these insights, we propose pathways toward knowledge-driven representation learning and multimodal fusion, with applications in music generation and lyrics retrieval.

Biography

Changhong Wang is a postdoc researcher in the Audio Data Analysis and Signal Processing (ADASP) group at Télécom Paris. She received her PhD in 2021 from the Centre for Digital Music at Queen Mary University of London. Her research focuses on audio-centred multimodal representation learning, particularly from the perspectives of interpretability and knowledge-driven deep learning. Changhong has served as a regular reviewer for conferences such as ICASSP and ISMIR since 2020, and occasionally for journals like TASLP. She actively supports the research community through various academic activities.