LLM-Powered Noise-Robust Sound Event Detection: From Wild Datasets to Query-Driven Separation
By Rohan Kumar Das
Rohan Kumar Das will give a talk on LLM-Powered Noise-Robust Sound Event Detection: From Wild Datasets to Query-Driven Separation
Abstract
Sound event detection (SED) is challenging in noisy environments where overlapping sounds obscure target events. The talk will first introduce a large language model (LLM) powered dataset namely wild domestic environment sound event detection (WildDESED). It is crafted as an extension to the original DESED dataset to reflect diverse acoustic variability and complex noises in home settings. We leveraged LLMs to generate eight different domestic scenarios based on target sound categories of the DESED dataset. Then we enriched the scenarios with a carefully tailored mixture of noises selected from AudioSet and ensured no overlap with target sound to create the WildDESED dataset. Language-queried audio source separation (LASS) aims to isolate the target sound events from a noisy clip. However, this approach can fail when the exact target sound is unknown, particularly in noisy test sets, leading to reduced performance. To address this issue, we leverage the capabilities of LLMs to analyze and summarize acoustic data. By using LLMs to identify and select specific noise types, we implement a noise augmentation method for noise-robust fine-tuning. The fine-tuned model is applied to predict clip-wise event predictions as text queries for the LASS model. Our studies demonstrate that the proposed method improves SED performance in noisy environments. This work represents an early application of LLMs in noise-robust SED and suggests a promising direction for handling overlapping events in SED.
Biography
Rohan Kumar Das received Ph.D degree from Indian Institute of Technology (IIT) Guwahati in 2017. After completing doctoral studies, he worked as a Data Scientist in Kovid Research Labs and was involved in speech analytics based application services in 2017. Later that year, he joined Human Language Technology Laboratory, National University of Singapore as a Research Fellow and led the speaker verification group’s research till March 2021. Currently, he is working as a Research and Development (R & D) Manager at Fortemedia Singapore. He was one of the organizers for special sessions on “The Attacker’s Perspective on Automatic Speaker Verification”, “Far-Field Speaker Verification Challenge 2020” in Interspeech 2020, the Voice Conversion Challenge 2020 and Face-voice Association in Multilingual Environments (FAME) Challenge 2024 in ACM Multimedia 2024. He served as Publication Chair of IEEE ASRU Workshop 2019, one of the Chairs of Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020 and one of the Special Session Chairs of ISCSLP 2022 as well as APSIPA ASC 2025. He has also been honored as APSIPA Distinguished Lecturer for the period 2025-2026. He is a Senior Member of IEEE, a member of ISCA and APSIPA. His research interests are speech & audio signal processing, speaker verification, anti-spoofing, voice conversion, pathological speech processing, emotion recognition, paralinguistics, sound event detection, machine learning and deep learning.