logo资料库

MELP标准PDF.pdf

第1页 / 共35页
第2页 / 共35页
第3页 / 共35页
第4页 / 共35页
第5页 / 共35页
第6页 / 共35页
第7页 / 共35页
第8页 / 共35页
资料共35页,剩余部分请下载后查看
(Draft -- May 28, 1998 -- Draft) Specifications for the Analog to Digital Conversion of Voice by 2,400 Bit/Second Mixed Excitation Linear Prediction 1. INTRODUCTION This standard describes the interoperability requirements relating to the conversion of analog voice to 2,400 bits/s digitized voice by a method known as Mixed Excitation Linear Prediction (MELP) and reconversion back to analog voice. An algorithm description is also included to aid implementa- tion as well as a performance verification process to verify an implementation. 2. CONVENTIONS AND DEFINITIONS 2.1 Frame Size A MELP frame interval is 22.5 ms 0.01 percent in duration and contains 180 voice samples (8,000 samples/s). 2.2 Analog Specification The recommended analog requirements for the MELP coder are for a nominal bandwidth ranging from 100 Hz to 3800 Hz. Although the MELP coder will operate with a more band limited signal, per- formance degradation will result. To ensure proper operation of the MELP coder, the A/D conversion process should produce peak values of (or near) -32768 and 32767. Additionally, the coder should have unity gain, which means that the output speech level should match that of the input speech. 3. ALGORITHM DESCRIPTION 3.1 Coder Overview The Mixed Excitation Linear Prediction coder is based on the traditional Linear Prediction Cod- ing (LPC) parametric model, but also includes five additional features [1][2]. These are: mixed excita- tion, aperiodic pulses, adaptive spectral enhancement, pulse dispersion, and Fourier magnitude modeling. These features are illustrated in the MELP decoder block diagram shown in Figure 1. The mixed excitation is implemented using a multi-band mixing model. This model can simulate frequency-dependent voicing strength using an adaptive filtering structure implemented with a fixed filter bank. The primary effect of this mixed excitation is to reduce the buzz usually associated with LPC vocoders, especially in broadband acoustic noise. When the input speech is voiced, the MELP coder can synthesize using either periodic or aperi- odic pulses. Aperiodic pulses are used most often during transition regions between voiced and unvoiced segments of the speech signal. This feature enables the decoder to reproduce erratic glottal pulses without introducing tonal sounds. The adaptive spectral enhancement filter is based on the poles of the linear prediction synthesis filter. Its use enhances the formant structure of the synthetic speech and improves the match between the synthetic and natural bandpass waveforms. It also gives the synthetic speech a more natural quality. Pulse dispersion is implemented using a fixed filter based on a spectrally-flattened triangle pulse. This filter spreads the excitation energy within a pitch period, reducing some of the harsh quality of the synthetic speech. -1- –
The first ten Fourier magnitudes are determined from the peaks of the Fourier transform of the prediction residual signal. The information in these coefficients improves the accuracy of the speech production model at the perceptually-important lower frequencies. This increases the quality of the synthetic speech, particularly for male speakers and when background noise is present. Pitch & Aperiodic Flag Inverse DFT Shaping Filter Fourier Magnitudes Bandpass Voicing Strengths Noise Generator Shaping Filter Adaptive Spectral Enhancement LPC Synthesis Filter scale Pulse Dispersion Filter Synthesized Speech LSF’s Gain Figure 1. MELP Decoder Block Diagram 3.2 Encoder Input speech is encoded by performing the following steps in the order given. 3.2.1 Low Frequency Removal. The first step in the encoding process is to remove any low fre- quency energy which may be present in the input signal. This is accomplished with a 4th order Cheby- chev type II highpass filter, having a cutoff frequency of 60 Hz and a stopband rejection of 30 dB. The filter output is referred to as the input speech signal throughout the following encoder description. A buffer containing the most recent samples of the input speech signal is maintained in the encoder. One of these samples is designated the last sample in the current frame. The buffer extends beyond this sample into the past and future to contain the samples needed for the encoding process. The last sample in the current frame serves as a reference point for many of the encoder calculations. 3.2.2 cessed with a 1 kHz, 6th order Butterworth lowpass filter. The integer pitch value, Integer Pitch Calculation. For this pitch calculation, the input speech signal is first pro- , is the value of , is maximized. This , for which the normalized autocorrelation function, r t( ) P1 , , = 40 41 … 160 function is defined by: , , -2- t t
where r t( ) = ct 0 t,( --------------------------------------- ct 0 0,( ) )ct ) t,( , ct m n, ( ) = – 2⁄ 79+ k = – 2⁄ 80– sk m+ sk n+ , (1) (2) 2⁄ s0 represents truncation to an integer value. The center of the pitch analysis window is at and in Eq. (2). For the integer pitch calculation, this window is centered on the last sample in sample when its input is the last sample in the cur- the current frame. The lowpass filter output is sample rent frame. The time index in the autocorrelation preserves the pitch analysis window alignment around its center point; the normalization compensates for changing signal amplitudes. The final pitch calculation (Section 3.2.9) extends the pitch range to a lag of 20 samples. s0 k 3.2.3 Bandpass Voicing Analysis. This portion of the encoder determines the five bandpass voic- ing strengths, . It also refines the integer pitch measurement and the correspond- ing normalized autocorrelation value. The bandpass voicing analysis begins by filtering the input speech signal into five frequency bands. These filters are 6th order Butterworth, with passbands of 0- 500, 500-1000, 1000-2000, 2000-3000, and 3000-4000 Hz. , 1 2 … 5 , Vbpi = , , i A refined pitch measurement is made using the 0-500 Hz filter output signal. This measurement is centered on the filter output produced when its input is the last sample in the current frame. Two pitch candidates are considered in this refinement, namely the integer pitch values from the cur- rent and previous frames. For each candidate, Eq. (1) is used to perform an integer pitch search over lags from 5 samples shorter to 5 samples longer than the candidate, and a fractional pitch refinement (Section 3.2.4) is performed around the optimum integer pitch lag. This produces two fractional pitch candidates and their corresponding normalized autocorrelation values. The candidate having the higher normalized autocorrelation is selected as the fractional pitch, . The corresponding normal- ized autocorrelation, is saved for use in Vbp1 P2 determining the voicing strength for the remaining frequency bands. It is also used in the final pitch calculation (Section 3.2.9) and gain calculation (Section 3.2.11). P2 , is saved as the lowest band voicing strength, r P2( P1 ( ) ) . r P2( ) For each remaining band, the bandpass voicing strength is the larger of as determined by the fractional pitch procedure for the bandpass signal and the time envelope of the bandpass signal, where for the time envelope is first decremented by 0.1 to compensate for an experimentally observed bias (due to the smoothness of the time envelope signals). The envelopes are calculated by full-wave rectification followed by a smoothing filter. This filter consists of a zero at DC in cascade with a complex pole pair at 150 Hz with a radius of 0.97. For each calculation of , the analysis window is centered on the last sample in the current frame, as was the case for the first band. r P2( r P2( ) ) 3.2.4 Fractional Pitch Refinement. This procedure, which is used at several places in the encod- ing process, utilizes an interpolation formula to increase the accuracy of an input pitch value. This value is first rounded to the nearest integer. Assume that this integer has a value of T samples. The interpolation formula presumes that has a maximum between lags of T and T+1. Hence, are computed and compared to determine if the maximum is more likely cT 0 T 1– to fall between T and T+1 or between T-1 and T. If , then the maximum proba- bly falls between T-1 and T and the pitch, T, is decremented by one prior to interpolation. The frac- tional offset, cT 0 T 1– cT 0 T cT 0 T and r t( ) ,( ,( ,( ,( 1+ 1+ > ) ) ) ) , is then computed by the interpolation equation: cT 0 T,( ,( – [ cT 0 T,( ) cT T ( ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ] cT 0 T ) )cT T T,( ) ] + ) )cT T T 1+ 1+ ,( cT T T ) cT T T,( [ cT T T cT 0 T 1+ , T ,( – ) ) 1+ ,( ,( 1+ 1+ 1+ – ) = , (3) -3- t t t t D D
cT m n, ( ) where of 0.0 to 1.0, so the offset is clamped between -1 and 2. The fractional pitch is between 20 and 160. is defined by Eq. (2). In some cases, this formula produces an offset outside the range and is clamped D+ T The normalized autocorrelation at the fractional pitch value is given by: D+( r T ) = --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ] cT 0 0,( ) D 2cT T ,+( D–( [ ) 1 1 T 1+ 1+ + ) ) D–( 1 )2cT T T,( + )cT 0 T,( ) 2D D–( + 1 ,( D cT 0 T )cT T T ,( 1+ ) . (4) The fractional pitch refinement procedure is based on work presented in [3]. Equations (3) and (4) produce the fractional offset and corresponding normalized autocorrelation which would be obtained if the input signal had been linearly interpolated to obtain values between the actual sampling times. 0.5< Vbp1 and set to 0 otherwise. The 3.2.5 Aperiodic Flag. The aperiodic flag is set to 1 if Vbp1 value determined by bandpass voicing analysis (Section 3.2.3) is used for this comparison. When set, this flag tells the decoder that the pulse component of the excitation should be aperiodic, rather than periodic. Section 3.3.1 describes the use of the aperiodic flag. 3.2.6 Linear Prediction Analysis. A 10th order linear prediction analysis is performed on the input speech signal using a 200 sample (25 ms) Hamming window centered on the last sample in the current frame. The traditional autocorrelation analysis procedure is implemented using the Levinson-Durbin recursion. In addition, a bandwidth expansion coefficient of 0.994 (15 Hz) is applied to the prediction coefficients, , where each coefficient is multiplied by 0.994i. 1 2 … 10 , = , , , i ai 3.2.7 Linear Prediction Residual Calculation. The linear prediction residual signal is calcu- lated by filtering the input speech signal with the prediction filter whose coefficients were determined by linear prediction analysis (Section 3.2.6). The residual window is centered on the last sample in the current frame, and is made wide enough for use by the final pitch calculation (Section 3.2.9). 3.2.8 Peakiness Calculation. The peakiness of the residual signal is calculated over a 160 sam- ple window centered on the last sample in the current frame. The peakiness value is the ratio of the L2 norm to the L1 norm of the residual signal, , in the window: rn peakiness = 160 1 --------- 160 2 rn ------------------------------ 1= n 1 --------- 160 160 n 1= rn . (5) If the peakiness exceeds 1.34, then the lowest band voicing strength, peakiness exceeds 1.6, then the lowest three band voicing strengths, 1.0. This is the only use of the peakiness measure. Vbpi , is forced to 1.0. If the Vbp1 , , are all forced to = , 1 2 3 , i 3.2.9 Final Pitch Calculation. The final pitch measurement uses the lowpass filtered residual signal, where the filter is a 6th order Butterworth, with a 1 kHz cutoff. Eq. (1) is used to perform an integer pitch search over lags from 5 samples shorter to 5 samples longer than , rounded to the nearest integer. This measurement is centered on the filter output produced when its input is the last residual sample in the current frame. A fractional pitch refinement (Section 3.2.4) is then made around the optimum integer pitch lag. This produces tentative values for the final pitch, , and for the corresponding normalized autocorrelation, P2 P3 . r P3( ) If 0.6 r P3( ) residual, using Dth otherwise. The doubling check procedure may produce new values for , the pitch doubling check procedure (Section 3.2.10) is performed on the filtered 0.5 P3 as the candidate pitch, and doubling threshold 0.75 and Dth , or 100 = . ) = P3 if P3 r P3( -4- ‡ £
The else action for the preceding if is as follows. A fractional pitch refinement around is per- formed using the input speech signal. This measurement is centered on the last sample in the current frame and produces new values for , the long-term average pitch (Section 3.2.12). Otherwise, the pitch doubling check procedure is performed on the input speech signal, using if 0.9 and P3 r P3( Dth otherwise. The doubling check procedure may produce new values for as the candidate pitch, and doubling threshold 100 . ) Finally, if The following pseudo code shows the final pitch algorithm: is replaced by is replaced by = P3 , then , then Pavg Pavg and r P3( r P3( r P3( Dth 0.55 0.55 , or ) . If = 0.7 < ) < ) P2 P3 P3 P3 P3 . inputs: the input speech signal; the residual signal; P2; Pavg outputs: P3, cor_P3 fresid buffer = filter the residual with a 1 kHz Butterworth P3 = best integer pitch on fresid over the range P2-5 to P2+5 P3, cor_P3 = frac_pitch(fresid, P3) if (cor_P3 >= 0.6) Dth = 0.5 if (P3 <= 100) Dth = 0.75 P3, cor_P3 = double_ck(fresid, P3, Dth) else P3, cor_P3 = frac_pitch(input, P2) if (cor_P3 < 0.55) P3 = Pavg else Dth = 0.7 if (P3 <= 100) Dth = 0.9 P3, cor_P3 = double_ck(input, P3, Dth) endif endif if (cor_P3 < 0.55) P3 = Pavg 3.2.10 Pitch Doubling Check. The pitch doubling check procedure looks for and corrects pitch val- ues which are multiples of the actual pitch. This procedure takes a signal, a candidate pitch , and a doubling threshold . r Pc( ) All fractional pitch calculations are made using the signal given to the doubling check procedure. P , and the corresponding correlation, , and returns the checked pitch Dth Pc This procedure begins with a fractional pitch refinement around . Next, the largest value of r Pc( ) 8 7 … 2 , , = k ; and 2) a double verification, if , producing . This produces tentative val- , where is calculated in two steps: 1) a fractional pitch refinement . If such a is found, then a frac- is found for which > ) Dthr Pc( ) and and r Pc k⁄ ( r Pc k⁄ ( ues for Pc k⁄ ( ) around tional pitch refinement around Pc 20 Pc k⁄ , Pk is performed, producing new values for k and 30< Pk P k ) . Pc r Pc( ) . Pk Finally, if The following pseudo code shows the pitch double check procedure: is less than 30 samples, then double verification is performed. Pc inputs: signal; P; Dth outputs: Pc, cor_Pc Pc, cor_Pc = frac_pitch(signal, P) for (k=8; k>=2; k--) Pk = Pc/k if (Pk >= 20) Pk, cor_Pk = frac_pitch(signal, Pk) if (Pk < 30) cor_Pk = double_ver(Pk, cor_Pk) if (cor_Pk > Dth * cor_Pc) Pc, cor_Pc = frac_pitch(signal, Pk) break endif endif endfor if (Pc < 30) cor_Pc = double_ver(Pc, cor_Pc) -5- £ ‡
For inputs P and r P( ) , the double verification procedure returns the smaller of r 2P( where in the double check procedure provides robustness against spurious short pitch values. is determined by the fractional pitch procedure around 2P ) , ) . The use of double verification and r 2P( r P( ) Vbp1 0.6> , the window length is the shortest multiple of 3.2.11 Gain Calculation. The input speech signal gain is measured twice per frame using a pitch- adaptive window length. This length is identical for both gain measurements and is determined as which is longer than 120 follows. When samples. If this length exceeds 320 samples, it is divided by 2. When , the window length is 0.6 and is centered 90 samples before 120 samples. The gain calculation for the first window produces the last sample in the current frame. The calculation for the second window produces and is cen- tered on the last sample in the current frame. The gain is the RMS value, measured in dB, of the sig- nal in the window, P2 Vbp1 G1 G2 : sn Gi = 10log10 0.01 + 1 --- L L n 1= 2 sn , (6) L is the window length. The 0.01 term prevents the log argument from going too close to zero. where If a gain measurement is less than 0.0, it is clamped to 0.0. The gain measurement assumes that the input signal range is -32768 to 32767 (Section 2.2). 3.2.12 Average Pitch Update. The long-term average pitch, smoothing procedure. If three most recent strong pitch values, are moved toward a default pitch, > 30 dB, then G2 , = i 50= , 1 2 3 samples, according to: , is updated with a simple is placed into a buffer containing the . Otherwise, all three pitch values in the buffer Pavg r P3( P3 ) , + 0.05Pdefault , i = , 1 2 3 , . (7) > 0.8 and pi Pdefault 0.95 pi = pi The average pitch is then updated as the median of the three values in the buffer. final pitch calculation (Section 3.2.9). Pavg is used in the , , = 1 2 … 10 , 3.2.13 Quantization of Prediction Coefficients. First, the linear prediction coefficients , ai , are converted into line spectrum frequencies (LSF’s). Details of the conversion algo- i rithm can be found in [4]. Next, a process which forces the LSF components to be in ascending order with a minimum separation of 50 Hz is performed. This process begins by checking all adjacent pairs of the LSF components and swapping any pair not in ascending order. This step is repeated as many as ten times, if necessary. The minimum separation criterion is then applied by correcting each pair, , as shown in the following pseudo code. f i The LSF components and frequency-related constants are in Hertz; scaling in other implementations may differ. The minimum separation process is repeated ten times. is less than 50 Hz, , for which and f i– min f i f i 1+ 1+ = d dmin = 50 for (i=1; i<10; i++) d = f[i+1] - f[i] if (d < dmin) s1 = s2 = (dmin-d)/2 if (i == 1 and f[i] < dmin) s1 = f[i]/2 else if (i > 1) tmp = f[i] - f[i-1] if (tmp < dmin) s1 = 0 else if (tmp < 2*dmin) s1 = (tmp-dmin)/2 endif if (i == 9 and f[i+1] > 4000-dmin) s2 = (4000-f[i+1])/2 else if (i < 9) tmp = f[i+2] - f[i+1] if (tmp < dmin) s2 = 0 else if (tmp < 2*dmin) s2 = (tmp-dmin)/2 endif -6- £ Ł ł D
f[i] = f[i] - s1 f[i+1] = f[i+1] + s2 endif endfor The resulting LSF vector, f , is then quantized using a multi-stage vector quantizer (MSVQ). The MSVQ codebook consists of four stages of 128, 64, 64, and 64 levels respectively. The quantized vector, fˆ , is the sum of the vectors selected by the search process, with one vector selected from each stage. The MSVQ search finds the codebook vector which minimizes the square of the weighted Euclidean distance, , between the unquantized and quantized LSF vectors: d2 d2 f fˆ,( ) 10= i 1= ( wi f i fˆ i– )2 , where )0.3 P f i( , 1 )0.3 i, )0.3 i, i 8 9= , 10= wi = 0.64P f i( 0.16P f i( (8) (9) is the ith component of the unquantized LSF vector, and is the inverse prediction filter power f i spectrum evaluated at frequency . The search procedure is an M-best approximation to a full search, in which the M=8 best code vectors from each stage are saved for use with the next stage; ref- erence [5] has additional details. The process to ensure ascending order and minimum separation (described in the first part of this section) is then applied to the quantized LSF vector. The resulting vector is used in the Fourier magnitude calculation (Section 3.2.17). P f i( ) f i 3.2.14 Pitch Quantization. The final pitch value, , is quantized on a logarithmic scale with a 99-level uniform quantizer ranging from 20 to 160 samples. These pitch values are then mapped to a 7-bit codeword using a look-up table, as shown in Section 4.1.1. The all-zero codeword represents the unvoiced state, and is sent if . All 28 codewords with Hamming weight of 1 or 2 are reserved for error protection. The uniform quantizer details are described in Section 4.1.7. Vbp1 P3 0.6 G2 G1 for the current frame is within 5 dB of 3.2.15 Gain Quantization. The two gain values are quantized as follows. is quantized with a 5- is quantized to 3 bits using the following adaptive bit uniform quantizer ranging from 10 to 77 dB. is within 3 algorithm. If values for the current and previous frames, then the frame is steady-state dB of the average of the and a special code (all zero) is sent to indicate that the decoder should set G2 values for the current and previous frames. Otherwise, the frame represents a transition and is values quantized with a 7-level uniform quantizer ranging from 6 dB below the minimum of the G2 values. The quantizer for the current and previous frames to 6 dB above the maximum of those range is clamped to 10 and 77 dB. The uniform quantizer details are described in Section 4.1.7. Pseudo code for the adaptive quantization of to the mean of the G1 for the previous frame, and is shown below. G2 G1 G1 G2 G2 G2 G1 if (|G2 - G2p| < 5.0 and |G1 - 0.5 *(G2 + G2p)| < 3.0) quantizer_index = 0 else gain_max = max(G2p, G2) + 6.0 gain_min = min(G2p, G2) - 6.0 if (gain_min < 10.0) gain_min = 10.0 if (gain_max > 77.0) gain_max = 77.0 quantizer_index values 1 to 7 are determined by quantizing G1 with a 7-level, uniform quantizer ranging from gain_min to gain_max endif -7- £ £ £
3.2.16 Bandpass Voicing Quantization. When (unvoiced), the remaining voicing strengths, , the remaining voicing strengths are quantized to 1 if their value exceeds 0.6, and quantized to 0 otherwise. There is one exception. If the quantized values of are 0001, respectively, then , are quantized to 0. When is quantized to 0. 0.6 0.6> , 2 3 4 5 , 2 3 4 5 Vbp1 Vbp1 Vb pi = = , i , i , , , , Vb pi Vbp5 3.2.17 Fourier Magnitude Calculation and Quantization. This analysis measures the Fourier magnitudes of the first 10 pitch harmonics of the prediction residual generated by the quantized pre- diction coefficients. It uses a 512-point Fast Fourier Transform (FFT) of a 200 sample window cen- tered at the end of the frame. First, a set of quantized predictor coefficients is calculated from the quantized LSF vector (Section 3.2.13). Then the residual window is generated using the quantized prediction coefficients. Next, a 200 sample Hamming window is applied, the signal is zero-padded to 512 points, and the complex FFT is performed. Finally, the complex FFT output is transformed into magnitudes, and the harmonics are found with a spectral peak-picking algorithm. The peak-picker finds the maximum within a width of ˆ P3 ˆ⁄ frequency samples centered 512 P3 is the quantized pitch. This width is around the initial estimate for each pitch harmonic, where ˆ⁄ truncated to an integer. The initial estimate for the location of the ith harmonic is . The num- 512i P3 4⁄ . These magnitudes ber of harmonic magnitudes searched for is limited to the smaller of 10 or are then normalized to have an RMS value of 1.0. If fewer than 10 harmonics are found, the remain- ing magnitudes are set to 1.0. ˆ P3 The 10 magnitudes are quantized with an 8-bit vector quantizer. The codebook is searched using a perceptually weighted Euclidean distance, with fixed weights that emphasize low frequencies over higher frequencies. The weights are given by: wi = 117 --------------------------------------------------------------------- 0.69 25 75 1 2 1.4 + + f i ------------ 1000 2 , i = 1 2 … 10 , , , , (10) = f i 8000i 60⁄ is the frequency in Hz corresponding to the ith harmonic for a default pitch where period of 60 samples. The weights are applied to the squared difference between the input Fourier magnitudes and the codebook values. 3.2.18 Error Protection and Bit Packing. The table in Section 4.3.2 shows the bit allocation for the MELP coder. To improve performance in channel errors, the unused coder parameters for the unvoiced mode are replaced with forward error correction. Three Hamming (7,4) codes and one Ham- ming (8,4) code are used. The (7,4) code corrects single bit-errors, while the (8,4) code in addition detects double bit-errors. The (8,4) code is applied to the 4 most significant bits (MSB’s) of the first MSVQ index, and the 4 parity bits are written over the bandpass voicing. The remaining 3 bits of the first MSVQ index along with a reserved bit (set to zero), are covered by a (7,4) code with the resulting 3 parity bits written to the MSB’s of the Fourier series VQ index. The 4 MSB’s of the codeword are protected with 3 parity bits which are written to the next 3 bits of the Fourier magnitudes. Finally, the LSB of the second gain index and the 3 bit codeword are protected with 3 parity bits written to the 2 LSBs of the Fourier magnitudes and the aperiodic flag. G1 G2 The bit transmission order is given in Section 4.3.3. -8- £ Ł ł Ł ł
分享到:
收藏