Chapter 3

MPEG Compression

3.1 Why Compress?

Good video compression is an essential factor in the successful implementation of a real-time video system.

A video stream consists of a sequence of pictures. It can occupy an astronomical amount of storage space if represented in raw digital form. Take, for example, a series of pictures that are each digitized into 360 picture elements (pels or pixels) per raster line and 288 lines (a resolution of 360-by-288 pels). Assume that every picture element can be represented by three colors: red, green and blue. If each color has an 8-bit precision, then each picture would occupy approximately 300Kbytes of storage space (360 x 288 x 3). Assuming that 24 pictures are displayed in every second, the display rate for this sequence is therefore almost 60Mbps (300x1024x8 bits/picture x 24 pictures/sec), and a one-minute video clip may take up a storage area of as much as 448Mbytes.

The need to compress a video sequence is obvious. The existing bandwidth capacity of an Ethernet network, which is 10Mbps, is certainly not enough for the transmission of non-compressed video. Data can only be transmitted at a speed of 10 million bits per second over an Ethernet network. If an application wants to use the 60Mbps video for real-time transmission, it will require a bandwidth capacity that is 60 times higher that what an Ethernet network has to offer.

This chapter focuses on MPEG-1, the video standard used in this thesis to produce a desirable stream of compressed video for transmission purposes. This standard defines a set of compression techniques that elegantly compresses a stream of raw digital video whose encoded product would later (upon arrival at its destination) be decompressed when needed to reproduce the original stream.

3.2 Background

The name MPEG is an acronym for Moving Picture Experts Group. Under the auspices of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), this committee was formed to develop the compression and synchronization of audio and video bitstreams. MPEG is also known as the working group ISO/IEC JTCI/SC29/WG11.

At the time of this writing, there are three known MPEG encoding standard types for digital video and audio. All three allow for synchronization of the audio-visual information. They are known respectively, as MPEG-1, MPEG-2 and MPEG-4. Data rate and applications are the distinguishing factors among all three. MPEG-1 is intended for intermediate data rates of approximately 1.5Mbps. MPEG-2 is an extension of MPEG-1 and is intended for higher data rates of 10Mbps or more. MPEG-4, which is currently in the algorithmic development stage, is designed to manage very low data rates of about 64Kbps and is of course, potentially useful for telecommunications purposes. MPEG-1 has been chosen as the encoding type to be used in this thesis because of its lower data rate and the availability of real-time MPEG-1 encoders in the market.

The main advantage of using the MPEG standard is that it can provide a motion picture quality display despite its compact nature. Consider again, from the example in the previous section, the one-minute uncompressed video clip that displays at a data rate of 60Mbps. Compared to an MPEG-1 encoded video stream, which can display at a rate of 1.25Mbits/s, the non-compressed video's data rate is 48 times higher. If a real-time streaming application were sending video at a rate equivalent to the display rate, it would be almost impossible to transmit a 60Mbps video stream over any network unless, the application was running on a high-speed ATM network whose bandwidth capacity is 155Mbytes per second. Even so, an application's bandwidth requirement should always be minimized. This is because there may be other applications on the network competing for bandwidth availability. Since the real-time application is constantly streaming data at 60 million bits per second into the network, its channel utilization will be almost 40 percent of the ATM network's bandwidth capacity.

There are two methods of encoding MPEG. The first is to program the compression stages at a software level. Although a relatively cheap solution, it is unfortunately extremely slow. The goal of providing real-time encoded video would not be accomplished if this method were used.

The other alternative is to design a hardware-based encoder which would be far more efficient and dedicated to generating MPEG audio and video. Initially, dedicated RISC processor-based machines were commonly the primary encoders available only to those who could afford a $75,000 piece of technology. Fortunately, newer technologies have nowadays reduced the production costs of hardware-based encoders to such an extent that they are more affordable now than ever.

3.3 Video Stream Structure

One shall be reminded that digital video in its natural form is actually a sequence of pictures, each consisting of an array of picture elements (pels or pixels). As will be discussed in the next section, a series of compression techniques will be applied to the gargantuan-sized raw video stream to produce a compressed MPEG-1 stream. A symmetrical decompression can then be applied to it to reproduce the original stream without losing its quality of presentation.

The MPEG-1 standard contains a set of rules that govern how the video stream should be encoded for the best compression result. Within the structure of MPEG-1 stream lies a sequence of "packets". Each of these packets is hierarchically subdivided into smaller chunks as shown in Figure 3.1.

The most macro object in the stream is the video sequence (labeled "Seq" in Figure 3.1). The MPEG video sequence begins with a sequence header (labeled "Seq SC" in Figure 3.1). Following this is a sequence of parameter packets, and one or more groups of pictures (GOPs).

In the group-of-pictures (GOP) there is a header and a series of one or more (compressed) pictures that allow for random access into the sequence. These pictures as will be seen in more detail later are categorized into three types: Intra-coded, Predictive and Bi-directional pictures.

In a compressed picture, the smallest coding unit in the MPEG algorithm is a block (pictured in Figure 3.1 as b0, ... , b5). A block is an 8-pixel by 8-line set of values (rectangular matrix) that represents either a luminance or chrominance component. There are three types of blocks: the luminance block (Y), the blue chrominance block (Cb) and the red chrominance block (Cr).

A 16-by-16 pixel segment of four Y-blocks, a Cr-block and a Cb-block would altogether compose a macroblock. The macroblock, shown as "MB" in

Figure 3.1, is the basic coding unit in the MPEG algorithm. Figure 3.2 gives a pictorial composition of a macroblock.

A picture in an MPEG sequence, however, is not simply an array of macroblocks. Rather, it is composed of slices, where each slice consists of one or more contiguous macroblocks, ordered from left-to-right and top-to-bottom.

Contiguous pictures within a video sequence often tend to contain similar information; this is known as temporal redundancy. The compression standard utilized by MPEG takes advantage of this fact by representing some pictures as the differences from other referenced pictures. As mentioned before, there are three kinds of compressed pictures used by the MPEG standard to form a closed group of pictures, namely intra-coded frame (I-frame), predictive-coded frame (P-frame) and bidirectionally-predictive-coded frame (B-frame).

I-frames are coded independently without reference to other pictures within a video sequence. P- and B-frames, however, are coded in reference to either I- or P-frames. A P-frame codes the differences between itself and a referenced picture from before. A B-frame extends this idea and codes the differences from the nearest preceding and upcoming pictures in a video sequence. In a closed group of pictures, P- and B-frames are predicted only from other pictures within that group of pictures.

3.4 MPEG-1 Coding Methods

Encoding MPEG can be application- or user-specific. The MPEG standard allows an encoder to choose the frequency and location of the I-frames, based on an application's need (or user's choice) for random accessibility. I-frames are chosen as the "anchor" because of their independence from all other pictures within a GOP. If random access were important in an application, the I-frames would usually occur twice in a second.


In an MPEG video stream, encoded pictures are arranged in a particular order for decoding efficiency. As can be seen in Figure 3.3, the pictures are ordered differently in the encoded and displaying stream. During the encoding process, B-frames are coded after the P-frames. However, when a decoder wants to display these pictures, the B-frames have to be ordered before the P-frame.

Image blocks have high spatial redundancy. That is, some picture elements (pixels) within a block may have similar values or contain details that the human eye cannot readily perceive. To reduce such redundancies, the MPEG algorithm first changes the pixel representation by transforming the 8-by-8 pixel blocks from its spatial domain to a frequency domain by means of a method called Discrete Cosine Transform (DCT). Then, it produces an 8-by-8 array of frequency coefficients that would approximately represent the data block. DCT is the heart of I-frame encoding.

After the DCT is computed for a data block, it is very desirable to be able to represent its frequency coefficients with less precision in exchange for greater compression. This can be achieved through a process called quantization. In this process, a DCT coefficient is quantized by dividing it with a non-zero positive integer (the quantization value). The resulting quotient is then rounded to the nearest integer. Furthermore, a matrix of quantization values is chosen by the encoder to determine how each frequency coefficient in the 8-by-8 block should be quantized. For example, large quantization values are used on high spatial frequencies, thus selectively discarding high spatial frequency activities that the human eye could hardly detect.


The quantized DCT coefficients are then put through a process of lossless encoding. That is, a decoder can reconstruct the same DCT values in a precise manner from the resulting encoding. The lossless encoding is based on the Huffman coding technique. Many of the quantized DCT frequency coefficients will be zero, especially for those coefficients of high spatial frequencies. Taking advantage of this fact, a scan through the 8-by-8 array of coefficients (see Figure 3.4) reorders them in a zigzag manner to group long runs of zeros for maximum compression. (As an estimated rule of thumb, the more zero coefficients there are, the better the compression.) These reordered coefficients are then converted into a series of run-amplitude pairs, with each pair indicating a number of zero coefficients and the amplitude of a non-zero coefficient. These pairs are then coded with the Huffman encoding scheme, also known as variable-length coding, which uses variable-length codes for representing the run-amplitude pairs. Shorter variable-length codes are used for the more commonly occurring pairs and longer codes for the less common pairs.

P- and B-frames are coded as differences. One technique of computing these differences is called motion compensation. Its purpose is to reduce temporal redundancy. The motion compensation algorithm works at the macroblock level. In doing motion compensation for a picture, motion vectors are calculated from inter-frame macroblocks. A motion vector is the spatial difference between the reference frame's macroblock and the macroblock being coded.

The difference between the B- and P-frames' motion compensation is that macroblocks in a P-frame use a previous reference only, while macroblocks in a B-frame are coded based on any combination of a previous or future reference picture.

The I-frame is not the only frame that uses DCT and quantization for compression. P- and B-frame encodings use it as well. The motion vectors are first transformed into DCT coefficients. This is then followed by quantization.

3.5 Summary

To summarize, the encoding process of video into an MPEG stream combines a number of compression techniques: Discrete Cosine Transform and quantization for both intra- and inter-frame encoding; Huffman encoding for the intra-frames; and motion compensation for inter-frames.

These compression techniques imply that the MPEG encoding process requires a hefty amount of computation and is time-consuming. For that reason, producing MPEG video in real-time has to be done on a hardware rather than software level. Encoding MPEG on hardware is faster and, therefore, more desirable.