【信息学】【2018.02】噪声环境下基于时频域信号模型的语音去混响

在这里插入图片描述

本文为德国埃尔朗根-纽伦堡大学(作者:Sebastian Braun)的博士论文,共164页。

混响是由所有反射声波的总和叠加而成,存在于任何传统的房间中。诸如免提模式的移动电话、平板电脑、智能电视、电话会议系统、助听器、语音控制系统等的语音通信设备都使用一个或多个麦克风来获取期望的语音信号。当麦克风不在期望声源附近时,强混响和噪声会降低麦克风接收的信号质量,并且会损害自动语音识别器的可懂度和性能。因此,对麦克风信号进行处理以降低混响和噪声是一项非常迫切的任务。从记录信号中减少或消除混响的过程称为去混响。

由于去混响通常是一个完全盲问题,其中唯一可用的信息是麦克风信号,并且由于声学场景可以是非平稳的,所以去混响是语音增强中最具挑战性的任务之一。虽然理论上完全去混响可以通过在某些条件下的逆滤波以及在已知房间脉冲响应(RIR)的条件下实现,但在实践中,RIR的盲辨识在时变和噪声环境中不够精确和鲁棒。因此,成功的去混响方法在时频域得到发展,实际中往往将问题放宽到部分去混响,主要减少后期混响拖尾效应。虽然近年来人们提出了一些鲁棒、有效的方法,在一定程度上减小了后期混响拖尾,但是利用最小延迟的实时处理技术,在不产生语音失真和伪影的情况下,获得高音频质量的去混响信号仍然是一个挑战。本文主要研究实时语音通信系统中在线处理的鲁棒去混响方法。

为了实现去混响,可以利用时间和空间信息。首先,混响引入了随时间推移的相关性,并延长了音素或声音事件的持续时间。通过利用时间相关性,可以推导出滤波器来提取期望的语音信号或减少混响。其次,通过使用多个麦克风,可以利用空间信息来区分相干直达声波和混响,混响具有空间扩散特性。为了提取相干声音,可以使用空间滤波器(也称为波束形成器)来组合麦克风信号,使得仅提取来自某一方向的声音,而抑制来自其它方向的声音和扩散分量。本文利用多种信号模型,从时间和空间两个方面对混响进行建模。所有考虑的信号模型都定义在短时傅立叶变换(STFT)域中,该域广泛用于许多语音和音频处理技术,因此允许与其它现有技术进行快速集成。特别地,我们利用了窄带滑动平均模型、窄带多通道自回归模型和基于空间相干性的模型。针对这三种信号模型中的每一种,都提出了一种去混响和降噪的方法。

第一个主要贡献是延迟混响功率谱密度(PSD)的单通道估计器,它需要通过计算维纳滤波器来降低混响和噪声。本文提出的混响PSD估计器是基于一个使用相对卷积传递函数(RCTF)的窄带滑动平均模型。与其它单通道混响PSD估计器相比,所提出的估计器准确地模拟了时变声学环境和加性噪声,并且不需要像混响时间或直达波混响比(DRR)这样的房间声学先验信息。

第二个主要贡献是基于空间相干性的多通道混响PSD估计器,其中混响被建模为具有时间不变空间相干性的加性扩散声音分量。在多信道情况下,期望信号可以由需要混响PSD的多信道维纳滤波器(MWF)估计。为了克服语音失真和伪影,提出了一种广义的MWF输出混响和噪声衰减分别独立控制方法。由于存在各种各样的这种单通道和多通道混响PSD估计器,因此还给出了最先进估计器的概述、比较和基准。为了克服所有混响PSD估计器的共同缺点,提出了一种高DRR的偏置补偿方法。

第三个主要贡献是基于时变声音环境的窄带多通道自回归(MAR)信号模型的去混响和降噪的在线解决方案。基于该模型,使用MAR系数从先前的混响语音样本中预测延迟混响,然后从当前混响信号中减去延迟混响。该方法的一个主要创新之处在于并行估计结构,它允许在噪声环境中获得时变MAR系数的因果估计。此外,还提出了一种独立控制混响量和降噪量的方法。

在本文的最后部分,我们使用客观测量法、听力测试法和语音识别系统对三种提出的去混响系统进行了比较。结果表明,该算法能有效地降低混响和噪声,可以直接应用于语音通信设备中。理论综述和算法评估给出了各种去混响方法的优缺点。通过将这些算法看作去混响的代表,提供了有用的见解和结论,有助于为特定的应用选择去混响方法。

Reverberation is the sum of reflected soundwaves and is present in any conventional room. Speech communication devicessuch as mobile phones in hands-free mode, tablets, smart TVs, teleconferencingsystems, hearing aids, voice-controlled systems, etc. use one or moremicrophones to pick up the desired speech signals. When the microphones are notin the proximity of the desired source, strong reverberation and noise candegrade the signal quality at the microphones and can impair theintelligibility and the performance of automatic speech recognizers. Therefore,it is a highly demanded task to process the microphone signals such thatreverberation and noise are reduced. The process of reducing or removingreverberation from recorded signals is called dereverberation. Asdereverberation is usually a completely blind problem, where the only availableinformation are the microphone signals, and as the acoustic scenario can benon-stationary, dereverberation is one of the most challenging tasks in speechenhancement. While in theory perfect dereverberation can be achieved by inversefiltering under some conditions and with knowledge of the room impulse response(RIR), in practice the blind identification of the RIR is not sufficientlyaccurate and robust in time-varying and noisy acoustic conditions. Therefore,successful dereverberation methods have been developed in the time-frequencydomain that often relax the problem to partial dereverberation, where mainlythe late reverberation tail is reduced. Although in the recent years somerobust and efficient methods have been proposed that can reduce the latereverberation tail to some extent, it is still challenging to obtain adereverberated signal with high audio quality, without speech distortion andartifacts using real-time processing techniques with minimal delay. In thisthesis, we focus on robust dereverberation methods for online processing asrequired in real-time speech communication systems. To achieve dereverberation,two main aspects can be exploited: temporal and spatial information. Firstly,reverberation introduces correlation over time and extends the duration ofphonemes or sound events. By exploiting temporal correlation, filters can bederived to extract the desired speech signal or to reduce the reverberation.Secondly, by using multiple microphones, spatial information can be exploitedto distinguish between the coherent direct sound and the reverberation, whichhas a spatially diffuse property. To extract the coherent sound, spatialfilters, also known as beamformers, can be used that combine the microphonesignals such that only sound from a certain direction is extracted, whereassound from other directions and diffuse sound components are suppressed. Inthis thesis, a variety of signal models is exploited to model reverberationusing temporal and spatial aspects. All considered signal models are defined inthe short-time Fourier transform (STFT) domain, which is widely used in manyspeech and audio processing techniques, therefore allowing simple integrationwith other existing techniques. In particular, we utilize a narrowband movingaverage model, a narrowband multichannel autoregressive model, and a spatialcoherence based model. For each of these three signal models, a method fordereverberation and noise reduction is proposed. The first main contribution isa single-channel estimator of the late reverberation power spectral density(PSD), which is required to compute a Wiener filter reducing reverberation andnoise. The proposed reverberation PSD estimator is based on a narrowband movingaverage model using relative convolutive transfer functions (RCTFs). Incontrast to other single-channel reverberation PSD estimators, the proposedestimator explicitly models time-varying acoustic conditions and additivenoise, and requires no prior information on the room acoustics like thereverberation time or the direct-to-reverberation ratio (DRR). The second maincontribution is a multichannel reverberation PSD estimator based on the spatialcoherence, where the reverberation is modeled as an additive diffuse soundcomponent with a time-invariant spatial coherence. In the multichannel case,the desired signal can be estimated by a multichannel Wiener filter (MWF) thatrequires the reverberation PSD. To mitigate speech distortion and artifacts, ageneralized method to control the attenuation of reverberation and noise at theoutput of a MWF independently is proposed. As there exists a wide variety ofsuch single- and multichannel reverberation PSD estimators, an extensiveoverview, comparison and benchmark of state-of-the-art estimators is provided.As a cure for a common weakness of all reverberation PSD estimators, a biascompensation for high DRRs is proposed. The third main contribution is anonline solution for dereverberation and noise reduction based on a narrowbandmultichannel autoregressive (MAR) signal model for time-varying acousticenvironments. Using this model, the late reverberation is predicted fromprevious reverberant speech samples using the MAR coefficients, and is thensubtracted from the current reverberant signal. A main novelty of this approachis a parallel estimation structure, that allows to obtain causal estimates oftime-varying MAR coefficients in noisy environments. In addition, a method tocontrol the amount of reverberation and noise reduction independently isproposed. In the last part of this thesis, the three proposed dereverberationsystems are compared using objective measures, a listening test, and anautomatic speech recognition system. It is shown that the proposed algorithmsefficiently reduce reverberation and noise, and can be directly applied inspeech communication devices. The theoretical overview and the evaluation showsthat each dereverberation method has different strengths and limitations. Byconsidering these algorithms as representatives of their dereverberation class,useful insights and conclusions are provided that can help for the choice of adereverberation method for a specific application.

1 引言
2 去混响的STFT域信号模型
3 频域和空域去混响抑制
4 单通道延迟混响PSD估计
5 多通道延迟混响PSD估计
6 基于多通道自回归模型的MIMO混响消除
7 对提出的去混响方法评估与比较
8 结论与展望
附录A 产生仿真信号的信号能量比定义
附录B 性能测量
附录C 计算残余噪声和混响

下载英文原文地址:

http://page3.dfpan.com/fs/1lcj02215291069f5a5/

更多精彩文章请关注微信号:在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/weixin_42825609/article/details/86486930