科研成果 by Type: Conference Paper

2025
Wu D, Wang Y, Wu X, Qu T. Cross-attention Inspired Selective State Space Models for Target Sound Extraction, in International Conference on Acoustics, Speech and Signal Processing (ICASSP). Hyderabad, India; 2025:1-5.Abstract
The Transformer model, particularly its cross-attention module, is widely used for feature fusion in target sound extraction which extracts the signal of interest based on given clues. Despite its effectiveness, this approach suffers from low computational efficiency. Recent advancements in state space models, notably the latest work Mamba, have shown comparable performance to Transformer-based methods while significantly reducing computational complexity in various tasks. However, Mamba’s applicability in target sound extraction is limited due to its inability to capture dependencies between different sequences as the cross-attention does. In this paper, we propose CrossMamba for target sound extraction, which leverages the hidden attention mechanism of Mamba to compute dependencies between the given clues and the audio mixture. The calculation of Mamba can be divided to the query, key and value. We utilize the clue to generate the query and the audio mixture to derive the key and value, adhering to the principle of the cross-attention mechanism in Transformers. Experimental results from two representative target sound extraction methods validate the efficacy of the proposed CrossMamba
You Y, Wu X, Qu T. TA-V2A: Textually Assisted Video-to-Audio Generation, in International Conference on Acoustics, Speech and Signal Processing (ICASSP). Hyderabad, India; 2025:1-5.Abstract
As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While Transformer and Diffusion models have advanced audio generation, a significant challenge persists in extracting precise semantic information from videos, as current models often lose sequential context by relying solely on frame-based features. To address this, we present TA-V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space. By incorporating large language models for enhanced video comprehension, our approach leverages text guidance to enrich semantic expression. Our diffusion model-based system utilizes automated text modulation to enhance inference quality and efficiency, providing personalized control through text-guided interfaces. This integration enhances semantic expression while ensuring temporal alignment, leading to more accurate and coherent video-to-audio generation.
Wu D, Du J, Qu T, Huang Q, Zhang D. Moving Sound Source Localization and Tracking based on Envelope Estimation for Unknown Number of Sources, in the AES 158th Convention. Warsaw, Poland; 2025:10216.Abstract
Existing methods for moving sound source localization and tracking face significant challenges when dealing withan unknown number of sound sources, which substantially limits their practical applications. This paper proposes amoving sound source tracking method based on source signal envelopes that does not require prior knowledge ofthe number of sources. First, an encoder-decoder attractor (EDA) method is used to estimate the number of sourcesand obtain an attractor for each source, based on which the signal envelope of each source is estimated. This signalenvelope is then used as a clue for tracking the target source. The proposed method has been validated throughsimulation experiments. Experimental results demonstrate that the proposed method can accurately estimate thenumber of sources and precisely track each source.
Wu D, Wu X, Qu T. Room Geometry Inference Using Localization of the SoundSource and Its Early Reflections, in the AES 158th Convention. Warsaw, Poland; 2025:10215.Abstract
Traditional methods for inferring room geometry from sound signals are predominantly based on Room ImpulseResponse (RIR) or prior knowledge of the sound source location. This significantly restricts the applicability ofthese approaches. This paper presents a method for estimating room geometry based on the localization of directsound source and its early reflections from First-Order Ambisonics (FOA) signals without the prior knowledge ofthe environment. First, this method simultaneously estimates the Direction of Arrival (DOA) of the direct sourceand the detected first-order reflected sources. Then, a Cross-attention-based network for implicitly extractingthe features related to Time Difference of Arrival (TDOA) between the direct source source and the first-orderreflected sources is proposed to estimate the distances of the direct and the first-order reflected sources. Finally,the room geometry is inferred from the localization results of the direct and the first-order reflected sources. Theeffectiveness of the proposed method was validated through simulation experiments. The experimental resultsdemonstrate that the method proposed achieves accurate localization results and performs well in inference of roomgeometry.
You Y, Qian Y, Qu T, Wang B, Lv X. Spherical harmonic beamforming basedAmbisonics encoding and upscaling method for smartphonemicrophone array, in the AES 158th Convention. Warsaw, Poland; 2025:10230.Abstract
With the rapid development of virtual reality (VR) and augmented reality (AR), spatial audio recording and reproductionhave gained increasing research interest. Higher Order Ambisonics (HOA) stands out for its adaptabilityto various playback devices and its ability to integrate head orientation. However, current HOA recordings oftenrely on bulky spherical microphone arrays (SMA), and portable devices like smartphones are limited by arrayconfiguration and number of microphones. We propose SHB-AE, a spherical harmonic beamforming based methodfor Ambisonics encoding using a smartphone microphone array (SPMA). By designing beamformers for eachorder of spherical harmonic functions based on the array manifold, the method enables Ambisonics encoding andup-scaling. Validation on a real SPMA and its simulated free-field counterpart in noisy and reverberant conditionsshowed that the method successfully encodes and up-scales Ambisonics up to the fourth order with just fourirregularly arranged microphones.
2024
Yuan Z, Gao S, Wu X, Qu T. Spatial Covariant Matrix based Learning for DOA Estimationin Spherical Harmonics Domain, in the AES 156th Convention. Madrid, Spain; 2024:10701.Abstract
Direction of arrival (DoA) estimation in complex environments is a challenging task. The traditional methods suffer from invalidity under low signal-to-noise ratio (SNR) and reverberation conditions, and the data-driven methods lack of generalization to unseen data types. In this paper we propose a robust DoA estimation approach by combining the two methods above. To focus on spatial information modeling, the proposed method directly uses the compressed covariance matrix of the first-order ambisonics (FOA) signal as input, while only white noise is used during training. To adapt to different characteristics of FOA signals in different frequency bands, our method estimates DoA in different frequency bands by particular models, and the subband results are finally integrated together. Experiments are carried out on both simulated and measured datasets, and the results show the superiority of the proposed method than existing baselines under complex conditions and the scalability for unseen data types.
Gao S, Wu X, Qu T. DOA-Informed Self-Supervised Learning Method for SoundSource Enhancement, in the AES 156th Convention. Madrid, Spain; 2024:10683.Abstract
The multiple-channel[1] sound source enhancement methods have made a great progress in recent years, especially when combined with the learning-based algorithms. However, the performance of these techniques is limited by the completeness of the training dataset, which may degrade in mismatched environments. In this paper, we propose a reconstruction Model based Self-supervised Learning (RMSL) method for sound source enhancement. A reconstruction module is used to integrate the estimated target signal and noise components to regenerate the multi-channel mixed signals, and it is connected with a separating model to form a closed loop.In this case, the optimization of the separation model can be achieved by continuously iterating the separation-reconstruction process. We use the separation error, the reconstruction error, and the signal-noise independence error as lossfunctions in the self-supervised learning process. This method is applied to the state-of-the-art sound source separation model (ADL-MVDR) and evaluated under different scenarios. Experimental results demonstrate that the proposed method can improve the performance of ADL-MVDR algorithm under different number of sound sources, bringing about 0.5 dB to 1 dB Si-SNR gain, while maintaining good clarity and intelligibility in practical application.
Ge Z, Li L, Qu T. A Hybrid Time and Time-frequency Domain Implicit NeuralRepresentation for Acoustic Fields, in the AES 156th Convention. Madrid, Spain; 2024:Express paper 196.Abstract
Creating an immersive scene relies on detailed spatial sound. Traditional methods, using probe points for impulse responses, need lots of storage. Meanwhile, geometry-based simulations struggle with complex sound effects. Now, neural-based methods are improving accuracy and slashing storage needs. In our study, we propose a hybrid time and time-frequency domain strategy to model the time series of Ambisonic acoustic fields. The networks excels in generating high-fidelity time-domain impulse responses at arbitrary source-recceiver positions by learning a continuous representation of the acoustic field. Our experimental results demonstrate that the proposed model outperforms baseline methods in various aspects of sound representation and rendering for different source-receiver positions.
Qian Y, Qu T, Tang W, Chen S, Shen W, Guo X, Chai H. Automotive acoustic channel equalization method using convex optimization in modal domain, in the AES 156th Convention. Madrid, Spain; 2024:11696.Abstract
Automotive audio systems often face sub-optimal sound quality due to the intricate acoustic properties of car cabins. Acoustic channel equalization methods are generally employed to improve sound reproduction quality in such environments. In this paper, we propose an acoustic channel equalization method using convex optimization in the modal domain. The modal domain representation is used to model the whole sound field to be equalized. Besides integrating it into the convex formulation of the acoustic channel reshaping problem, to further control the prering artifacts, the temporal window function modified according to the backward masking effect of the human auditory system is used during equalizer design. Objective and subjective experiments in a real automotive cabin proved that the proposed method enhances spatial robustness and avoids the audible prering artifacts.
Wu D, Wu X, Qu T. Exploiting Motion Information in Sound Source Localizationand Tracking, in the AES 156th Convention. Madrid, Spain; 2024:10687.Abstract
Deep neural networks can be employed for estimating the direction of arrival (DOA) of individual sound sources from audio signals. Existing methods mostly focus on estimating the DOA of each source on individual frames, without utilizing the motion information of the sources. This paper proposes a method for estimating trajectories of sources, leveraging the differential of trajectories across different time scales. Additionally, a neural network is employed for enhancing the trajectories wrongly estimated especially for sound sources with low-energy. Experimental evaluations conducted on simulated dataset validate that the proposed method achieves more precise localization and tracking performance and encounters less interference when the sound source energy is low.
Wu D, Wu X, Qu T. A HYBRID DEEP-ONLINE LEARNING BASED METHOD FOR ACTIVE NOISE CONTROLIN WAVE DOMAIN, in International Conference on Acoustics, Speech and Signal Processing (ICASSP). COEX, Seoul, Korea; 2024:1-5.Abstract
The traditional feedback Active Noise Control (ANC) algorithms arebuilt upon linear filters, which leads to reduced performance whendealing with real-world noise. Deep learning-based feedback ANCalgorithms have been proposed to overcome this problem. However,methods relying on pre-trained neural networks exhibit performancedegradation when encountering noise from unseen scenes inthe training dataset. This paper proposed a hybrid deep-online learningbased spatial ANC system which combines online learning withpre-trained deep neural networks. The proposed method can keepthe performance on noise from the trained scenes while improve theperformance of cancelling noise from new scenes. Additionally, byincorporating wave domain decomposition, this paper achieves noisecancellation over a control spatial region. Simulation experimentsvalidate the effectiveness of the combination of online learning anddeep learning in handling previously unseen noise. Furthermore, theefficiency of wave domain decomposition in spatial noise cancellationis also verified.
2023
Yuan Z, Wu D, Wu X, Qu T. Sound event localization and detection based on iterative separation in embedding space, in 2023 6th International Conference on Information Communication and Signal Processing (ICICSP). Xian, China; 2023:455-459.
Wang Y, Lan Z, Wu X, Qu T. TT-Net: Dual-Path Transformer Based Sound Field Translation in the Spherical Harmonic Domain, in International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island, Greece; 2023:1-5.
Ge Z, Tian P, Li L, Qu T. Rendering Near-field Point Sound Sources Through an Iterative Weighted Crosstalk Cancellation Method, in Audio Engineering Society Convention 154. Helsinki, Finland; 2023:10649.
2022
Qu T, Xu J, Yuan Z, Wu X. Higher order ambisonics compression method based onautoencoder, in Audio Engineering Society Convention 153. online; 2022:Express paper 9. 访问链接Abstract
The compression of three-dimensional sound field signals has always been a very important issue. Recently, an Independent Component Analysis (ICA) based Higher Order Ambisonics (HOA) compression method introduces blind source separation to solve the shortcomings of discontinuity between frames in the existing Singular Value Decomposition (SVD) based methods. However, ICA is weak to model the reverberant environment, and its target is not to recover original signal. In this work, we replace ICA with autoencoder to further improve the above method’s ability to cope with reverberation conditions and ensure the unanimous optimization both in separation and recovery by reconstruction loss. We constructed a dataset with simulated and recorded signals, and verified the effectiveness of our method through objective and subjective experiments.
Chao P, Wang Y, Wu X, Qu T. A Multi-channel Speech Separation System for Unknown Number of Multiple Speakers, in 2022 5th International Conference on Information Communication and Signal Processing (ICICSP). Shenzhen, China; 2022.
Gao S, Wu X, Qu T. Localization of Direct Source and Early Reflections Using HOA Processing and DNN Model, in Audio Engineering Society Convention 152.; 2022:10560. 访问链接
Wang Y, Wu X, Qu T. UP-WGAN: Upscaling Ambisonic Sound Scenes Using Wasserstein Generative Adversarial Networks, in Audio Engineering Society Convention 152.; 2022:10577. 访问链接
2021
Chen J, Wu X, Qu T. Early Reflections Based Speech Enhancement, in 2021 4th International Conference on Information Communication and Signal Processing (ICICSP). ShangHai, China; 2021:183-187.
Xu J, Niu Y, Wu X, Qu T. Higher order ambisonics compression method based on independent component analysis, in Audio Engineering Society Convention 150.; 2021:10456.

Pages