JPEG image hardware decoding low power consumption technical solution
In order to achieve real-time data processing and low power consumption applications, this paper proposes a parallel, fully pipelined JPEG decoder implementation scheme with a clock management mechanism.
At present, China is preparing for the construction of the Internet of Things, which poses challenges to the development of sensor technology and the instantaneous mass data generated by digital image sensors for the storage capacity, transmission bandwidth and power consumption of real-time communication systems. In areas such as medical treatment and remote sensing image communication, which require high image restoration quality, the demand for image encoders / decoders with low power consumption, good compression / decompression performance, and real-time processing capabilities has become increasingly urgent. JPEG still image compression / decompression standard has excellent compression / decompression performance, and requires low storage capacity and relatively low complexity to make it very suitable for hardware implementation.
1 JPEG decoding algorithm
JPEG (Joint Photographic Experts Group) is a wide range of still image data compression standards. JPEG compression is a lossy compression, it uses the characteristics of human visual system, using a combination of quantization and lossless compression coding to remove visual redundant information and redundant information of the data itself. The JPEG decoder includes: Huffman (Huffman) decoding, inverse quantization (IQ) and IDCT transform. In JPEG, the decoding of images is performed in blocks. The entire image is divided into several 8 × 8 data blocks (MCUs), and each block corresponds to an 8 × 8 pixel array of the original image. The codec order of each line is from top to bottom, and the codec order of lines is from left to right [1].
2 Parallel Huffman decoder
The length of the code after Huffman coding is inconsistent. If the decoder is implemented with serial technology, the number of cycles required to solve one codeword is also different due to the inconsistent code length. For real-time systems, the efficiency of serial technology is relatively low. In addition, if the data is interrupted by noise during propagation, the entire set of data becomes worthless. In response to these two problems, this article proposes the following solutions. Figure 1 shows the main components and algorithm flow of Huffman decoding.
Algorithm flow: Obtain 32-bit compressed image data from the input, analyze the input data stream, determine the code length, shift the input data, and add new data from the input. The input data is translated into the original data through the Huffman table, and the symbol bits embedded in the data stream are extracted. After a series of division and subtraction operations, the frequency data before encoding is obtained, combined with the previously obtained symbol bits, and then sent to the output buffer.
The algorithm used in this paper flexibly uses the characteristics of the Huffman table, eliminates the multiplication operation in the algorithm, and only needs one cycle to complete the code length judgment. Arrange the data of the code table according to the code length classification from small to large, and then arrange the data with the same code length according to the size of the code word from small to large. Each table stores the decoding results DR (Decoding Results) corresponding to the codewords in the ROM in the arranged order. This is not only conducive to look-up tables, but also requires the smallest ROM, in line with low power consumption requirements. The address generator of the table lookup obtains a base address from the code length passed by the "Length Match" module, and the code length intercepts consecutive bits from the input data as the offset address, and 2 addresses The addition is the address saved by DR [2].
Because the position of the key bit is in the last few bits of the codeword, the input data is shifted according to the code length, so that the last bit of the key bit appears in the nth bit, and the result of the shift only outputs before the nth bit For several bits of this type, such a circuit only requires a barrel shift register that is only controlled by the code length. In addition, for each table, an address correction string of 1 string 0 plus 1 string 1 is generated. There are several key bits with several ones. This part of the circuit is simple in logic and occupies few circuits. Use this address to modify the string and the output of the barrel shift register to perform an AND logic operation, and the result is the correct offset address. Since the longest bit required by the Huffman table is 9 bits, and the maximum code length is 19 bits, this article has designed a barrel shift register with 19-bit input and 9-bit output. The area of ​​the circuit after the improvement is reduced to about 50% before the improvement.
3 IDCT processor inverse discrete cosine transform IDCT (Inverse Discrete Cosine Transform) circuit overall implementation block diagram and 2D IDCT block diagram shown in Figure 2. The DCT coefficients are processed by the inverse quantization and inverse scanning circuits and then input to the IDCT buffer. The global control circuit controls the input to the 2D IDCT unit and sends the final converted data to the output buffer. The Ready signal is sent to the motion compensation unit. Inform the unit that the IDCT data can be read. The 2D IDCT unit performs two 1D IDCT operations, first performing row-based 1D IDCT, then transposing and buffering the intermediate results of the first IDCT through the transposition memory, and then performing column-based 1D IDCT transformation to obtain the final IDCT transformation result [3].
The IDCT design uses zero value judgment logic circuits, gated clocks, parallel pipelines and other technologies, which makes the entire circuit greatly reduce power consumption on the basis of meeting the processing speed and accuracy requirements.
3.1 Zero value judgment logic circuit During the entire image decoding process, about 8% of the data blocks in about 8% of the data blocks have DCT coefficients of zero, and it is meaningless to perform IDCT transformation on these zero values. Therefore, this design adds zero value judgment logic to eliminate unnecessary multiplication. The zero value judgment logic circuit is composed of an 8 × 8 accumulator array, a zero value judgment logic module, and a multiplexer MUX. Judging by the zero-value logic module, when the operands are not all zero, the enable signal becomes a high level, the operands are taken to the register, and then the multiplication operation is performed. If the operands are all zero, the accumulation array is blocked and 0 is output directly through MUX. The zero-value judgment logic can effectively reduce power consumption, and the circuit is simple, and the area and delay time are almost negligible.
3.2 The gate clock based on the latch through the input clock of the control circuit can make part of the circuit reduce the operating frequency or stop working, thereby reducing the power consumption of the entire circuit. The circuit of 2D DCT / IDCT is mainly composed of 3 parts: 1D DCT / IDCT unit, transpose memory, input and output processing unit.
The transposed memory part is only updated at the end of each 1D DCT / IDCT process, and the input and output processing unit only works when data is input and output. Therefore, controlling the input clock of these parts of the circuit so that it stops working most of the time can effectively reduce power consumption. The design results show that the system power consumption can be reduced by 13% when the area is only increased by 2%.
The gated clock based on the latch can achieve the above functions. It has the advantages of not requiring a data selector, small area, reduced capacitance on the clock network, and reduced internal power consumption of the gated register. The latch gate clock circuit and timing are shown in Figure 3.
3.3 Parallel pipeline This design uses addition and shift operations to replace the floating-point multiplication unit in the IDCT fast algorithm, and uses a highly parallel pipeline VLSI structure to speed up the data processing speed, and the data processing time is less than 1/5 of the serial structure. Therefore, the clock frequency can be correspondingly reduced to about 1/5 of the serial structure, thereby reducing the power consumption of the system. For example, two 16 × 8 multipliers are used to calculate the high-order part and the low-order part in parallel at the same time to obtain the high-order part product and the low-order part product respectively, and then perform shift addition. When implementing circuit operations, time overlap, resource reuse, and resource sharing are realized, which improves the parallelism of the system, thereby improving the operating speed and efficiency of the multiplication circuit.
4 Simulation and synthesis results In this paper, a JPEG image with a size of 1 920 × 1 080 is selected. The waveform of Modelsim after RTL simulation is shown in Figure 4. In the figure, JPEG_DATA is the code stream data, and OutR, OutG, and OutB are the decoding simulation results [4]. The decoding core module is synthesized at a frequency of 100 MHz [5], and the results are shown in Table 1.
This article is different from the previous implementation of JPEG decoding with software, but while implementing JPEG decoding with hardware, improve the hardware structure and reduce hardware decoding energy consumption through a variety of easy-to-operate methods. Through EDA tool verification, it can fully meet the requirements of JPEG image hardware decoding.
SHAOXING COLORBEE PLASTIC CO.,LTD , https://www.fantaicolorbee.com