科研成果 by Year: 2025

2025
Wu D, Wang Y, Wu X, Qu T. Cross-attention Inspired Selective State Space Models for Target Sound Extraction, in International Conference on Acoustics, Speech and Signal Processing (ICASSP). Hyderabad, India; 2025:1-5.Abstract
The Transformer model, particularly its cross-attention module, is widely used for feature fusion in target sound extraction which extracts the signal of interest based on given clues. Despite its effectiveness, this approach suffers from low computational efficiency. Recent advancements in state space models, notably the latest work Mamba, have shown comparable performance to Transformer-based methods while significantly reducing computational complexity in various tasks. However, Mamba’s applicability in target sound extraction is limited due to its inability to capture dependencies between different sequences as the cross-attention does. In this paper, we propose CrossMamba for target sound extraction, which leverages the hidden attention mechanism of Mamba to compute dependencies between the given clues and the audio mixture. The calculation of Mamba can be divided to the query, key and value. We utilize the clue to generate the query and the audio mixture to derive the key and value, adhering to the principle of the cross-attention mechanism in Transformers. Experimental results from two representative target sound extraction methods validate the efficacy of the proposed CrossMamba
You Y, Wu X, Qu T. TA-V2A: Textually Assisted Video-to-Audio Generation, in International Conference on Acoustics, Speech and Signal Processing (ICASSP). Hyderabad, India; 2025:1-5.Abstract
As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While Transformer and Diffusion models have advanced audio generation, a significant challenge persists in extracting precise semantic information from videos, as current models often lose sequential context by relying solely on frame-based features. To address this, we present TA-V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space. By incorporating large language models for enhanced video comprehension, our approach leverages text guidance to enrich semantic expression. Our diffusion model-based system utilizes automated text modulation to enhance inference quality and efficiency, providing personalized control through text-guided interfaces. This integration enhances semantic expression while ensuring temporal alignment, leading to more accurate and coherent video-to-audio generation.
曲天书, 吴玺宏. 基于球麦克风阵列的高阶声场记录与重放在电影音频制作中的应用. 现代电影技术. 2025;(2):4-11.Abstract
随着电影对极致沉浸式视听体验的发展需求,沉浸式声场记录和重放技术日显重要。本文围绕电影音频制作技术中的声场记录和重放问题,介绍了基于球麦克风阵列的高阶高保真立体声(Higher Order Ambisonics,HOA)分析技术,并针对球麦克风阵列球谐分解中的低频噪声与高频混叠问题,以及双耳重放技术中的阶数受限问题,给出了相应解决方案,研究表明所提方案可为观众提供更真实、更具沉浸感的声场重放效果,提升了观影体验,在电影音频制作中具有广阔的应用前景。
Wu D, Du J, Qu T, Huang Q, Zhang D. Moving Sound Source Localization and Tracking based on Envelope Estimation for Unknown Number of Sources, in the AES 158th Convention. Warsaw, Poland; 2025:10216.Abstract
Existing methods for moving sound source localization and tracking face significant challenges when dealing withan unknown number of sound sources, which substantially limits their practical applications. This paper proposes amoving sound source tracking method based on source signal envelopes that does not require prior knowledge ofthe number of sources. First, an encoder-decoder attractor (EDA) method is used to estimate the number of sourcesand obtain an attractor for each source, based on which the signal envelope of each source is estimated. This signalenvelope is then used as a clue for tracking the target source. The proposed method has been validated throughsimulation experiments. Experimental results demonstrate that the proposed method can accurately estimate thenumber of sources and precisely track each source.
Wu D, Wu X, Qu T. Room Geometry Inference Using Localization of the SoundSource and Its Early Reflections, in the AES 158th Convention. Warsaw, Poland; 2025:10215.Abstract
Traditional methods for inferring room geometry from sound signals are predominantly based on Room ImpulseResponse (RIR) or prior knowledge of the sound source location. This significantly restricts the applicability ofthese approaches. This paper presents a method for estimating room geometry based on the localization of directsound source and its early reflections from First-Order Ambisonics (FOA) signals without the prior knowledge ofthe environment. First, this method simultaneously estimates the Direction of Arrival (DOA) of the direct sourceand the detected first-order reflected sources. Then, a Cross-attention-based network for implicitly extractingthe features related to Time Difference of Arrival (TDOA) between the direct source source and the first-orderreflected sources is proposed to estimate the distances of the direct and the first-order reflected sources. Finally,the room geometry is inferred from the localization results of the direct and the first-order reflected sources. Theeffectiveness of the proposed method was validated through simulation experiments. The experimental resultsdemonstrate that the method proposed achieves accurate localization results and performs well in inference of roomgeometry.
You Y, Qian Y, Qu T, Wang B, Lv X. Spherical harmonic beamforming basedAmbisonics encoding and upscaling method for smartphonemicrophone array, in the AES 158th Convention. Warsaw, Poland; 2025:10230.Abstract
With the rapid development of virtual reality (VR) and augmented reality (AR), spatial audio recording and reproductionhave gained increasing research interest. Higher Order Ambisonics (HOA) stands out for its adaptabilityto various playback devices and its ability to integrate head orientation. However, current HOA recordings oftenrely on bulky spherical microphone arrays (SMA), and portable devices like smartphones are limited by arrayconfiguration and number of microphones. We propose SHB-AE, a spherical harmonic beamforming based methodfor Ambisonics encoding using a smartphone microphone array (SPMA). By designing beamformers for eachorder of spherical harmonic functions based on the array manifold, the method enables Ambisonics encoding andup-scaling. Validation on a real SPMA and its simulated free-field counterpart in noisy and reverberant conditionsshowed that the method successfully encodes and up-scales Ambisonics up to the fourth order with just fourirregularly arranged microphones.
曲天书, 吴玺宏, 吴东航, 杜佳琪.; 2025. 一种基于包络估计的未知声源数量移动声源定位跟踪方法. China patent CN 202510538463.1.
曲天书, 吴玺宏, 吴东航.; 2025. 基于直达声源与一阶反射声源定位的房间几何推断方法. China patent CN 202510556943.0.
曲天书, 吴玺宏, 游宇寰.; 2025. 一种基于文本辅助的视频到音频生成方法. China patent CN 202510298019.7.
曲天书, 吴玺宏, 吴东航.; 2025. 一种基于交叉注意力-状态空间模型的目标声源提取方法. China patent CN 202510290916.3.
Wu D, Wu X, Qu T. Leveraging Sound Source Trajectories for Universal Sound Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2025;33:2337-2348.Abstract
Existing methods utilizing spatial information for sound source separation require prior knowledge of the direction of arrival (DOA) of the source or utilize estimated but imprecise localization results, which impairs the separation performance, especially when the sound sources are moving. In fact, sound source localization and separation are interconnected problems, that is, sound source localization facilitates sound separation while sound separation contributes to refined source localization. This paper proposes a method utilizing the mutual facilitation mechanism between sound source localization and separation for moving sources. The proposed method comprises three stages. The first stage is initial tracking, which tracks each sound source from the audio mixture based on the source signal envelope estimation. These tracking results may lack sufficient accuracy. The second stage involves mutual facilitation: Sound separation is conducted using preliminary sound source tracking results. Subsequently, sound source tracking is performed on the separated signals, thereby refining the tracking precision. The refined trajectories further improve separation performance. This mutual facilitation process can be iterated multiple times. In the third stage, a neural beamformer estimates precise single-channel separation results based on the refined tracking trajectories and multi-channel separation outputs. Simulation experiments conducted under reverberant conditions and with moving sound sources demonstrate that the proposed method can achieve more accurate separation based on refined tracking results.