Realization of formant vocoder based on speech spectrum

2019-02-01 19:11:47

This article refers to the address: http://

1 Introduction

The speech coding algorithm can obtain higher quality reconstructed speech at low bit rate by using the redundant information of the speech signal and some insensitive information, and compression coding has always been a key technology in communication. Speech signal researchers have been looking for a way to minimize the coding bit rate of speech signals while maintaining a significant degradation in speech quality, in particular, low bit rate speech coding systems (with bit rates below 4.8 kb/s) Because of its wide range of needs, researchers have paid attention to it.

The performance of a speech coder is often measured in terms of bit rate, delay, complexity, and quality. Therefore, these attributes should be considered when analyzing the performance of a speech coder. It is worth noting that these attributes are not isolated, but closely related to each other. For example, low bit rate encoders generally have more delays, higher algorithm complexity, and higher bit rate encoders. Lower voice quality. Therefore, when choosing various coding algorithms, we should make a trade-off between these attributes according to the actual application environment.

The formant parameter coding algorithm is applied more and more widely in low-rate audio coding. Compared with the compression algorithm based on time domain waveform, he only needs to transmit the fundamental frequency and formant parameters used to construct the signal during the transmission process, so the transmission rate can be greatly reduced and the multimedia communication at low bit rate can be realized. Moreover, the algorithm based on formant parameters does not have to strictly limit the structure of the signal, and he can flexibly describe the characteristics of the audio signal. This flexibility determines the algorithm based on the formant parameters to meet the need for easy access and control of the audio signal.

2 fundamental frequency and formant extraction

Accurate extraction of fundamental and formant parameters plays a crucial role in the quality of the formant coding algorithm. In this paper, the improved double Fourier transform algorithm is used for speech parameter extraction. The speech spectrum required by the analysis algorithm of this subject is obtained by the SA-0505 spectroscopy of the language company. The highest resolution of the SA-0505 spectrum analyzer of the machine company is frequency resolution of 5 Hz and time resolution of 5 ms. The result of the analysis is the amplitude function of each frequency component and does not contain phase information. Since the phase information in the speech signal does not affect the speech resolution, further work on this basis has great significance.

In the actual speech parameter extraction process, the speech signal is first analyzed by the machine language spectrometer to obtain the time-frequency analysis spectrum of the speech. As shown in Figure 1.

Fourier transform is performed on the spectrum sequence at each moment, and the Fourier transform of the spectrum sequence at the time shown in FIG. 2 is as shown in FIG.

As can be seen from Fig. 2, since the actual speech is a quasi-periodic signal and a frequency analysis of a short-time signal, the spectral sequence is not a sampling of a periodic impulse function sequence, but an approximate triangular pulse sampling, so The amplitude spectrum of the Fourier transform exhibits high frequency attenuation properties. It can be observed from Fig. 3 that the amplitude spectrum of the spectral sequence is the product of the periodic signal and the high frequency attenuation signal. In the actual speech analysis process, the amplitude of the attenuation of the spectral sequence at each time is very different, and the amplitude of the branch pulse sometimes appears to be larger than the amplitude of the main pulse of the next period in the low frequency part, which produces a certain degree of periodic resolution of the signal. Interference, but cannot accurately estimate the fundamental frequency value. Therefore, in the determination of the fundamental frequency, this paper uses the characteristics of small difference in the attenuation amplitude of the high-frequency part, analyzes its periodic characteristics and uses it to calculate the fundamental frequency of speech.

The formant parameters include the formant frequency, the bandwidth and the amplitude, and the formant information is included in the envelope of the speech spectrum. Therefore, the key to the extraction of the formant parameters is to estimate the envelope of the speech spectrum, and consider that the maximum value in the spectral envelope is the formant. The envelope of the speech spectrum can be obtained by inverse transforming the corresponding low frequency portion of the Fourier transform of the speech spectrum. The first to fourth formants are determined according to the magnitude of each peak energy of the spectrum envelope, as shown in FIG.

For the test of the accuracy of the extracted parameters, the results of the manual analysis can be compared in the time domain and the frequency domain. This method can quantitatively calculate the accuracy of the extraction algorithm, but the workload is difficult to achieve. Since the two basic information of the fundamental frequency and the formant are the main feature points of the speech signal discrimination, the performance of the parameter extraction algorithm can be understood by judging the speech quality of the two parameters reconstructed speech signals. The speech signal reconstruction adopts the harmonic synthesis method, that is, the envelope of the speech spectrum is first established according to the formant information, and then the amplitude of the fundamental frequency and its harmonics is determined according to the spectral envelope and the speech signal is synthesized. In this paper, the parameters obtained by the application are regenerated, and the quality of the synthesized speech is subjectively distinguished. Based on this, the accuracy of the parameter extraction algorithm is judged. In a short time, the speech signal can be regarded as a stationary signal, so the speech spectrum of each frame can also be simplified into a set of discrete signals, and the discrete quantized values â€‹â€‹are the fundamental frequencies. Synthesize the speech signal according to the discrete spectrum using equation (1), equation (2):

V(t) is a synthesized speech signal, and fp is a fundamental frequency. To avoid spikes, set the phase Ï†n(Ï‰) function:

Compare the speech signals synthesized by the discrete spectra determined by the three methods:

(1) Discrete the spectrum of the primitive directly;

(2) Discretization of the resulting envelope;

(3) Discrete the spectral envelope determined from the resonance peak.

The specific scheme of this method is: since the human ear is sensitive to the center frequency in the parameters of the formant and is not sensitive to the amplitude and bandwidth, this paper only synthesizes the speech by using two parameter information of the center frequency and the maximum amplitude of the formant. .

The bandwidth of each formant is uniformly set to 300 Hz according to the resonant peak bandwidth of the adult speech signal of approximately 300 Hz. When the speech spectrum envelope is re-created, the gate signal with a width of 300 Hz is determined by taking the center frequency of each formant as the midpoint and the maximum energy as the amplitude, and then determining the harmonics of the fundamental frequency according to the newly generated envelope. Amplitude. Synthesizing speech with the first spectrogram sounds only a slight change in sound quality, which can clearly distinguish each syllable, and completely retain the original tone's intonation, tone and speaker's voice quality characteristics. This shows that the algorithm can accurately extract the fundamental frequency information, and this synthesis method can synthesize high quality speech signals. The speech quality characteristics of the speech in the speech synthesized by the second spectrogram are somewhat unclear, and the other aspects are the same as the first one. In the third spectrum and the speech, the speaker's voice quality is completely filtered out. The individual syllables are somewhat vague, but the tone and intonation information are completely preserved.

3 formant speech coding

The formant coding algorithm requires two parameters, the fundamental frequency and the formant. Experiments show that the application of fundamental frequency and formant information can not only reconstruct the vowel and voiced consonant parts of the speech, but also reconstruct the consonant part. First, because the fundamental frequency parameter determined by the parameter extraction algorithm in the clear consonant part is unstable, the speech signal reconstructed according to the unstable parameter will hop, and the hopping signal is similar to the clear consonant spectrum. The more important reason that the ear of the human ear is the transitional sign for the consonant, so the consonant part can be reconstructed as long as the resonance peak is accurately provided. According to the study of speech signal synthesis, the most important representation of the voiced signal is the first three formants. A formant model of a speech signal can obtain a well-understood synthesized voiced sound using only the first three time-varying formant frequencies. Considering that a pseudo formant may occur under special circumstances, the algorithm retains four formant parameters according to the magnitude of the formant amplitude when determining the coding parameters.

3.1 Parameter quantification

The two main indicators of speech coding algorithms are bit rate and speech quality. Low-rate speech coding algorithms require a minimum bit rate reduction based on speech intelligibility. In order to determine the maximum quantization degree of each parameter, we re-synthesize the speech after different degrees of quantization of each parameter, and evaluate the speech quality of each quantization degree.

The fundamental frequency of normal speech varies from 50 to 500 Hz. The fundamental frequency quantization experiment shows that the reconstructed speech signal is still clear when the fundamental frequency quantization accuracy is 20 Hz. Therefore, the fundamental frequency of encoding can be represented by 5 b at the minimum, but the base frequency is encoded by 8 b in order to improve the error resistance. The quantization of the formant is divided into frequency value quantization and amplitude quantization. According to the parameter extraction algorithm, the formant curve is the envelope of the fundamental frequency and its harmonics. We can think that the speech spectrum is the sampling signal of the fundamental frequency and its harmonics to the formant curve, so the fundamental frequency value can be used as the accuracy to describe the formant curve. The center frequency value of the formant can be determined by the first harmonic of the fundamental frequency, so the range of variation is 1 to 32, and the encoding is represented by 5 b. The human ear is not sensitive to the amplitude of the formant. The speech experiment shows that the speech is recorded with 16-bit sampling accuracy in the time domain. When the signal amplitude varies from 210 to 215, the speech can be clearly expressed when the amplitude is encoded by 3 b. Therefore, for each formant, it can be quantized by 8 b, where 5 b represents the center frequency and 3 b represents the amplitude.

3.2 Encoding rules

The speech frame period in encoding can be divided into dynamic and fixed forms. The dynamic form is that the period of each frame is determined according to the fundamental frequency, that is, each frame is a fundamental frequency period. In this way, the speech intelligibility is best when decoding, but the coding rate is high due to the small length of the frame period. The fixed form is that the period of the frame is fixed, and can be set to 10 to 40 ms depending on the actual situation. The length of the cycle is inversely proportional to the sound quality and proportional to the compression ratio. In this algorithm, the frame period is set to 25 ms in a fixed form. It is judged whether there is speech according to the spectral energy value, and is encoded with a 0 byte when there is no speech. We use a byte to represent the silence frame, in order to improve the algorithm's error resistance.

3.3 Results

This algorithm is used to encode and decode a speech material that is read at a normal speech rate. After decoding, the speech intelligibility is good, and the average bit rate is 1 400 b/s.

4 Conclusion

In theory, as long as there are accurate fundamental frequency and formant parameters, all the features of the original speech signal except the voice quality of the voice can be recovered. The parameters used in this algorithm are only the fundamental frequency and four formants. For speech signals, these parameters are characteristic parameters for distinguishing speech information. When the encoded information contains only these parameters, it can be considered that there is no redundant information for each frame of the signal, that is, the maximum compression is achieved for each frame of the signal. If the compression ratio is to be further improved on the basis of the algorithm, it can only be designed for the association between frames and frames, such as vector quantization algorithm.

The encoding algorithm has short delay and low complexity, and can be used for real-time voice signal transmission. The performance is good on the three evaluation indexes of bit rate, delay and complexity. After decoding, the voice has slight machine sound and individual syllables. There are two factors that lead to poor speech quality: one is the error in parameter quantization. According to the experimental analysis, the error is mainly the formant quantization error, so the quantization code is selected according to the actual requirements between the sound quality and the coding rate. Second, the speech reconstruction algorithm In the reconstruction of this paper, the amplitude-frequency characteristics of the formant are simply expressed by the gate function. If the reconstruction algorithm can be improved on the basis of studying the amplitude-frequency characteristics of the formant, the speech quality after decoding will be improved.

5.1 Multimedia Speaker with USB/SD/AUX fumction

AC power supply

mainly used for Computer ,DVD ,LAPTOP

New fashion design speaker products for markets

5.1 PC speaker with Aux in

Never again settle for tinny laptop speakers. Boost your home entertainment to a new level with5.1 channel Bluetooth Speaker system.

Simply plug the 5.1 channel speaker into the 3.5mm jack of your notebook or music player for rich bass and dynamic surround sound.

The 5.1 channel system also supports for FM, Aux in.

Speaker for PC, MP3, MP4, Notebook.

markets

5.1 Wireless Speaker

5.1 Speaker with Bluetooth,5.1 Speaker with NFC Function,5.1 Surround Sound Wireless Speakers,5.1 Wireless PC Speakers

winkey international technology co., limited , http://www.yqspeaker.com