This chapter discusses related contributions by a number of researchers
in their quest for a successful real-time video transmission.
Several approaches have been suggested to control for the loss
and latency of video packets on a network. At the time of writing,
I have explored and implemented methods, which are similar to
some of the approaches being advocated by these researchers.
In particular, I have been motivated by (Papadopoulos et al, 1996)'s
approach and have borrowed some of their ideas while implementing
my program. Certain works discussed in this chapter have led
to the potential development of industry-wide standards while
others have provided improvements to the technologies currently
exploited by commercial software.
As much as the Internet is well-known for its openness, scalability and various other qualities, the current services that it provides are not adequate for time-critical data transmission such as video or audio communication. One primary obstacle, already mentioned in Chapter One, is the Transmission Control Protocol (TCP). TCP, because of its network congestion control mechanism, will limit the amount of data an application can send into the network at any one time to minimize congestion. In addition, it provides reliability to the data transmission and will attempt to resend dropped packets until all the Internet packets intended for the receiver arrive at their destination. All these will ultimately accumulate in a considerable latency between the data source and its destination. For real-time applications, extensive latency is undesirable.
To minimize latency, researchers - (Schulzrinne), (Papadopoulos
et al, 1996) - have suggested the use of datagram services (for
example, the User Datagram Protocol, UDP) instead of guaranteed
services. With the datagram service class (also known as the
best-effort service class), packet delivery is quicker and therefore,
more likely to satisfy an arrival deadline. Of course, even if
we used UDP, network latency is still a reality because cross-network
propagation takes time. Also, with UDP, packet delivery is not
guaranteed. To recover from any loss and to conceal the inherent
latency, several approaches have been proposed to improve the
performance of continuous video playback.
(Papadopoulos et al, 1996) has suggested that even though packet retransmission has generally been considered inappropriate for video streaming because of its latency, it is still attractive because it requires minimal processing cost and less bandwidth than other recovery/prevention methods (see Section 4.6 through 4.8).
A retransmission initiated by the receiver requires at least one additional round-trip delay to recover the lost packets from the sender. A receiver can detect and identify a lost packet by associating each packet expected to arrive with a time-out value. If the timer expires before the associated packet arrives, the receiver shall then request a retransmission. An alternative to timer-based loss detection advocated by (Papadopoulos et al, 1996), is to allow each packet to carry a sequence number. With this number a receiver can detect a gap if any exists between any two packets that have arrived, with the latter packet having a higher sequence number than expected. This gap, which constitutes the number of lost packets, is detected only after another packet has been received.
(Papadopoulos et al, 1996) suggests that gap-based loss detection has more advantages over timer-based detection. Lost packets can be detected more quickly with the gap-based method, provided that data is sent frequently and that the gap of lost packets in between any two packets received is not too large. A large gap would imply a prolonged packet loss period on the network. In addition, the gap-based method does not need per-packet timers.
With the timer-based detection, it can be difficult to accurately determine the time-out value for each packet. In order to estimate that value, the receiver has to estimate the amount of time it would take to retransmit and recover from a loss. This value should be equivalent to the round-trip propagation delay (2 x network latency). The round-trip delay however, is difficult to estimate because the network latency time fluctuates from time to time, especially when the network is congested. Typically, the time-out values are typically assigned to a number several times higher than the round-trip delay, which adds a significant delay to loss detection. If we reduced the granularity of timers to a number equivalent to the round-trip delay, and ignored the possibility that packets are frequently delayed by a few time units over that time-out value, a timer-based retransmission can be expensive from a network bandwidth standpoint. This is because the receiver is prematurely asking for packet retransmission.
The gap-based loss detection however, is applicable only if the
underlying network preserves the packet sequencing order. Packets
sent over the Internet sequentially using UDP could experience
reshuffling when they are "floating" on the network;
since not all packets would take the same routes to get to their
destination, the propagation of each is unpredictable, thus resulting
in out-of-order packet delivery at the receiver end.
The life span of each real-time video packet is limited. Once the receiver has used them for display, they will be useless and henceforth discarded. Similarly, a packet that arrives too late for the video display will also be discarded. This implies that if the receiver knows ahead of time that a retransmitted packet will not arrive promptly for the display, it should not waste its efforts in trying to recover the packet. By avoiding such late retransmissions, it not only saves on network bandwidth but also avoids contributing to network congestion.
By allowing the receiver to keep an estimate of the round-trip
delay and the display time of each packet, a late retransmission
can be prevented. A retransmission request for a particular packet
is valid only if the time left before its display is greater than
the current round-trip-delay estimate.
(Chen et al) suggested in their work on Vosaic, a WWW browser,
that since MPEG frames are inter-dependent, a real-time application
can selectively request a retransmission of only the I-frame.
Recall from Chapter 3 that the three MPEG frame types are arranged
into groups with display sequence that may correspond to the pattern
I B B P B B P B B. The I frame (intra-coded) is
the anchor of this group and is needed by all P and B
frames for decoding. Should the I frame be lost, the rest
of the frames within the sequence group would be undecodable.
Ultimately, the quality of video display is strongly dependent
on the I frame. The application may choose to request
retransmission of only the I frames while making the other
frame types dispensable should losses occur. By doing so, the
application can save on the amount of bandwidth required during
packet retransmissions.
The use of a playout buffer queue as a temporary storage for incoming prefetched packets at the receiving end has been adopted by many researchers (Papadopoulos et al, 1996), (Chen et al), (Feng et al), (McManus et al), (Reibman et al), (Salehi et al), (Rexford et al) and (Zhang et al) to develop techniques that "smooth out" video displays. The important notion to draw from these techniques is that buffers can maximize the chances of recovery from a retransmission or an unanticipated delay. Said another way, the time available to recover a lost or delayed packet may be increased by several fold if a buffer were introduced to a real-time application. (Rexford et al) takes this notion one step further by asserting that a larger buffer can reduce the bandwidth requirements at the expense of a longer startup delay. This means that if the application had a larger buffer, it could "slow down" on its transmission of packets, which clearly would reduce the network's bandwidth requirements, but this would imply that in order to obtain a smoother display, the receiving end would have to wait longer before the it can begin the display.
Clearly, the size of the buffer is a tradeoff between the recovery
time and the startup delay imposed by the application. The larger
the buffer, the better the application's chances to recover but
at the same time, the amount of time a user has to wait before
the video presentation begins playing increases. Ideally however,
the application can employ a buffer size that is sufficient to
store incoming packets while a retransmission is ongoing. This
is equivalent to at least the number of packets expected to arrive
in a round-trip duration (Chen et al). Of course, the buffer
size can be as large as one chooses, but that may require more
resource (memory) allocation than is actually needed.
Many smoothing techniques have contributed to the research of compressed, pre-recorded video stream based on pre-computed information such as the frame size and length of playout. In contrast to stored video, live applications typically have limited knowledge about this information and therefore cannot accurately predict what is to come. Size may vary from frame to frame and this number cannot be determined until the frame has already been produced. Unfortunately, a live application cannot provide information about future frames that have not yet been produced.
Existing techniques for transmitting pre-recorded video, however, can serve as the foundation for developing new techniques more appropriate for live video.
An important consideration is the consumption of bandwidth by the live video stream. MPEG streams often require high bandwidth (1 Mbps) and often compete for resources on the network. Furthermore, the transmission rate varies due to the nature of MPEG compression; the I, P, and B frames of the stream vary in size. The I-frame is typically the largest, followed by P. The B frame is the most compact of the three. Keep in mind also, that each frame is consumed at a constant rate. Hence, if the frames were regularly transmitted, this would imply a variable bit rate transmission. Clearly, the bandwidth requirement is higher for I frame tranmissions than for P or B frame transmissions. Such a "bursty" nature in the MPEG stream may lead to inefficiency in transmission because the channel's resources will either be (a) over-utilized when the bandwidth requirement is high, thus causing congestion and loss or (b) under-utilized when the bandwidth requirement is low (i.e. when transmitting B frames). However, if the stream's irregular consumption could be readjusted or "smoothed out", the network bandwidth utilization would be optimal.
(McManus et al), (Salehi et al) and (Sen et al) had developed techniques that addressed these bandwidth requirement issues for pre-recorded video. These researchers have, to a similar degree, suggested a constant rate transmission for pre-recorded video streams in order to reduce the rate variability. To make constant-bit-rate transmission work, they exploited buffering capabilities on the receiving end while experimenting with the effects of tolerable delayed playbacks, that ranged from as little as 1 second to 1 minute.
(Rexford) furthered their work on stored video by extending the
constant-bit-rate concept and made it applicable to live video
streams. What (Rexford et al) did was to "smooth out"
the transmission rate as a group of frames, typically a GOP (recall
from Chapter 3: Group-of-Pictures), became available from the
video encoder. Again, buffering capabilities were used, not only
at the receiving end but also at the source end. Smoothing out
the transmission required using a small moving window (interval)
over the source buffer where the rate is determined as the window
slides over the buffer.
When packets are transported over the Internet, the available bandwidth, packet loss and delay may vary from one time to another. For live video applications using UDP -- the Internet's best-effort service class -- these variations are not known a priori since they depend on the behavior of other applications' connections throughout the Internet. In this case, network congestion may not be explicit to the application. Applications that are unaware of any congestion may continue to transmit huge amounts of data and thus contribute even further to the congestion. Real-time video applications tend to fall into that category of applications that are "impolite" and consistently disregard the network's bandwidth capability. TCP-based applications work differently. While applications running on TCP may not have a priori information about the state of the network traffic, the TCP service class does, however, provide a congestion control mechanism to estimate the available bandwidth and can therefore, scale the packet transmission rate as needed.
In (Bolot et al, 1994) and (Bolot et al, 1998), the researchers
had proposed an approach to control the rate at which video packets
are transmitted. Do keep in mind that these are real-time packets
whose life span is finite and short, and that prompt delivery
is desirable. Hence, their approach was to adapt the transmission
rate according to the network's available bandwidth. To control
the transmission rate, their proposal was to explicitly readjust
the source output rate (the encoder's output rate at the sender
side) according to the network traffic. If there were heavy congestion,
the application would transmit at a slower display frame rate.
If there were little congestion, if at all, the application would
increase its rate to the maximum allowed. Their proposed mechanism
have been implemented on their video-conferencing tool, the IVS
-- INRIA VideoConferencing System.
Another approach to recover from losses, apart from retransmission, uses FEC-based (Forward Error Correction) control mechanisms. In these mechanisms, redundant data is sent along with each original video packet in the hope that when some packets are lost, the original data could still be reconstructed from the redundant information already available to the receiver. What this implies is that data loss is recovered without experiencing any end-to-end exchange between the sender and receiver, as had been seen in retransmission-based recovery mechanism. Clearly, this implication makes FEC mechanisms very attractive for applications that are delay sensitive.
A variety of FEC mechanisms have been proposed. The simpler mechanisms involve sending, for every k-th packet, a redundant packet obtained by exclusive-ORing the k packets before it. Such mechanisms can recover a single lost packet for every k packets transmitted. These mechanisms however, increase the source transmission rate by a factor of 1/k, and add latency since k packets will need to be received before a lost packet (within this sequence of k packets) can be reconstructed. FEC mechanisms are clearly computationally intensive and may thus be an undesirable overhead cost to the receiver.