Spectral Subtraction

Spectral Subtraction is a way to reduce audio noise.Spectral information required to describe the noise spectrum is obtained from the signal measured during nonspeech activity.So,we need get some nonspeech frames to define noise spectrum,It’s about:

D ( w ) = P s ( w ) − P n ( w ) P s ′ ( w ) = { D ( w ) if D(w)>0 0 otherwise D(w) = P_s(w) - P_n(w) \\ P_s'(w) = \begin{cases}D(w)& \text{if D(w)>0}\\ 0& \text{otherwise}\end{cases} D(w)=Ps(w)Pn(w)Ps(w)={ D(w)0if D(w)>0otherwise

In that, P s ( w ) P_s(w) Ps(w) is speech spectrum with noise, P n ( w ) P_n(w) Pn(w) is noise spectrum from the signal measured during nonspeech activity. P s ′ ( w ) P_s'(w) Ps(w) is the modified signal spectrum.This isn’t a good way. When environment noise changes, P n ( w ) P_n(w) Pn(w) will not be noise spectrum in new environment.

A major problem with above implementation of the spectral noise subtraction method has been that a ‘new’ noise appears in the processed speech signal.

Our modification to the noise subtraction method consists in minimizing the perception of the narrow spectral peaks by decreasing thr spectral excursions.This is done by changing the algorithm in the following:

D ( w ) = P s ( w ) − α P n ( w ) P s ′ ( w ) = { D ( w ) , i f D ( w ) > β P n ( w ) β P n ( w ) , o t h e r s i z e w i t h      α ≥ 1      a n d      0 < β ≪ 1 D(w) = P_s(w)-\alpha P_n(w) \\ P_s'(w) =\begin{cases}D(w),if D(w)>\beta P_n(w) \\ \beta P_n(w) ,othersize \end{cases} \\ with \ \ \ \ \alpha \geq 1 \ \ \ \ and \ \ \ \ 0<\beta \ll 1 D(w)=Ps(w)αPn(w)Ps(w)={ D(w)ifD(w)>βPn(w)βPn(w)othersizewith    α1    and    0<β1

Where α \alpha α is the subtraction factor and β \beta β is the spectral floor parameter.The modified method is shown in the following figure.


In practice,we have found that at S N R = 0 d B SNR=0dB SNR=0dB,a value of α \alpha α in the range 3 to 6 is adequate,with β \beta β int the range 0.005 to 0.1.A large value of α \alpha α,such as 5, should not be alarming. This is equivalent to assuming that the noise power to be subtracted is about 7 dB higher than the smoothed estimate. This “inflation” factor represents the fact that, at each frame, the variance of the spectral components of the noise is equal to the noise power itself. Hence, one must subtract more than the expected value of the noise spectrum (the smoothed estimate) in order to make sure that most of the noise peaks have been removed.

In order to reduce the speech distortion caused by large values of α \alpha α, we decides to let α \alpha α vary from frame to frame within the same sentence.To understand the rationale behind doing so,consider the graph of following figure.


The SNR is estimated at each frame from knowledge of the noise spectral estimate and the energy of the input speech.At each frame,the actual value of α \alpha α used is gived by:

α = α 0 − ( S N R ) / s f o r      − 5 ≤ S N R ≤ 20 \alpha = \alpha _0 - (SNR)/s \\ for \ \ \ \ -5 \leq SNR \leq 20 α=α0(SNR)/sfor    5SNR20

Where α 0 \alpha _0 α0 is the desired value of α \alpha α at S N R = 0 d B SNR=0dB SNR=0dB,SNR is the estimated segmental signal-to-noise ratio and 1 / s 1/s 1/s is the slope of the above line(For example, for α = 4 \alpha = 4 α=4, s = 20 / 3 s=20/3 s=20/3).We found that using a variable subtraction reduces the speech distortion somewhat.If the slope( 1 / s 1/s 1/s) is too large,however,the temporal dynamic range of the speech becomes too large.

