Saturday, September 3, 2011

RED queuing discipline


All queuing disciplines we have seen so far (FIFO, PRIO, TBF, SFQ) and also HTB that we will see later, base their queuing algorithm in the well-known FIFO queue. As you remember, in this queuing discipline packets enter the queue at the tail and leave it from the head. This, that could be considered something very simple and natural, pose a serious problem to those flows based on the, de-facto standard, TCP transport protocol.
Problems arise when queues overflow due to bursty of packets. At this moment more packets can not be admitted and they have to be dropped where they are entering, this means, at the tail of the queue. That's the reason why all queuing discipline based on the FIFO queue are known as DropTail queues.


Mark Anthony Parris in his excellent work "Class-Based Thresholds-Lightweight Active Router-Queue Management for Multimedia Networking" [12], approaches this theme in a magisterial way. Let's read from his work:
 
First­in­first­out, drop­tail­when­full was the original queue management scheme used in Internet routers. With this scheme packets are enqueued at the tail of a queue as they arrive and dequeued from the head of queue when there is capacity on the link. Drop­tail is the policy of dropping the arriving packet when the queue is full. (Other alternatives include dropping the packet at the head of the queue.) Drop­tail and FIFO are used inter­changeably in this dissertation.
 
Braden, et al., point out several problems with simple drop­tail­on­full and recommends that Internet routers employ more sophisticated techniques for managing their queues [Braden98]. The two major problems they identify are the problems of lock­out and full­queues.
 
Lock­out refers to a phenomenon in which the shared resource, link bandwidth, is un­fairly consumed exclusively by a small number of flows. The remaining flows are locked­out of (i.e., denied access to) the queue and, consequently, locked out of the outbound link. In this phenomenon the queue is occupied only by the packets from a small number of flows while the packets associated with most flows are consistently discarded. As a result, most flows receive none of the link bandwidth, and starve.
 
This phenomenon occurs because of timing effects which result in some flows' packets always arriving to find the queue full. For example, consider a situation where many sources are periodically sending bursts of packets that in aggregate exceeds the queue's capacity. If these sources become synchronized, all sending nearly simultaneously, the first packets to arrive (e.g. from the source closer to the bottleneck link) will find a queue with some available capacity while the subsequent packets will be discarded. If the same relative order is maintained between the sources, those sources that send first will consistently make progress while the other flows will consistently have all packets discarded and, thus, starve.
 
  

Full­queues are queues that are usually occupied to capacity. If bursts of packets arrive to a full queue, many packets are dropped simultaneously. This can lead to large oscillations in the network utilization. If the dropped packets are from different flows there may be synchronized responses (back­off) among multiple flows. Synchronized back­off is a phenomenon in which many sources simultaneously receive congestion notification and reduce their generated load. As a result, the overall load on the network may drop below the capacity of the link and then rise back to exceed the link's capacity resulting in a full queue and once again leading to simultaneous drops. This oscillating behavior is exactly counter to the buffer's intended function, acting as a smoothing filter.
Many authors have studied these problems. In 1993, Sally Floyd and Van Jacobson presented their paper "Random Early Detection Gateways for Congestion Avoidance" [13] where the RED gateway was outlined. The idea behind RED is to provide, as soon as is possible, a feedback to responsive flows (like TCP) before the queue overflows in an effort to indicate that congestion is inminent, instead of waiting until the congestion has become excessive. Also, packet drops are distributed more fairly across all flows. Let's see how Floyd & Jacobson [13] explain guidelines for RED gateway design in their paper. Sorry, but this part is a little long; also if it is necessary I will insert some minor comments:
This section summarizes some of the design goals and guidelines for RED gateways. The main goal is to provide congestion avoidance by controlling the average queue size. Additional goals include the avoidance of global synchronization and of a bias against bursty traffic and the ability to maintain an upper bound on the average queue size even in the absence of cooperation from transport-layer protocols.
Okay, let's explain this a little. They try to avoid congestion by controlling the average queue size at the gateway or router. First problem when network get congested is that router's queue overflows and begins to drop packets. Avoiding packet drops (being as conservative as is possible) the global synchronization problem is minimized (where several TCP connections reduce their throughput to a minimum at the same time due, mainly, to a packet massacre at router queue). Also RED gateway pretends to be as condescendent as possible with bursty traffic, trying again to protect packet survival. Finally, by controlling the maximum average queue size an indirect control is exercised over those 'bad citizen' unresponsive flows, like UDP.
The first job of a congestion avoidance mechanism at the gateway is to detect incipient congestion. As defined in [18], a congestion avoidance scheme maintains the network in a region of low delay and high throughput. The average queue size should be kept low, while fluctuations in the actual queue size should be allowed to accommodate bursty traffic and transient congestion. Because the gateway can monitor the size of the queue over time, the gateway is the appropriate agent to detect incipient congestion. Because the gateway has a unified view of the various sources contributing to this congestion, the gateway is also the appropriate agent to decide which sources to notify of this congestion.
In a network with connections with a range of roundtrip times, throughput requirements, and delay sensitivities, the gateway is the most appropriate agent to determine the size and duration of short-lived bursts in queue size to be accommodated by the gateway. The gateway can do this by controlling the time constants used by the low-pass filter for computing the average queue size. The goal of the gateway is to detect incipient congestion that has persisted for a "long time" (several roundtrip times).
Again a little explanation. A low-pass filter is a mechanism to soft or smooth the current aleatory behavior of the queue length throughout the time. Have a look to figure 2.6.2 below where two curves representing current queue size and average queue size (this obtained applying a low-pass filter to the current queue size) are depicted. They use what is called an Exponential Weighted Moving Average (EWMA); something simple but great invented by a genious mathematician (I don't know who) sometime ago.
The second job of a congestion avoidance gateway is to decide which connections to notify of congestion at the gateway. If congestion is detected before the gateway buffer is full, it is not necessary for the gateway to drop packets to notify sources of congestion. In this paper, we say that the gateway marks a packet, and notifies the source to reduce the window for that connection. This marking and notification can consist of dropping a packet, setting a bit in a packet header, or some other method understood by the transport protocol. The current feedback mechanism in TCP/IP networks is for the gateway to drop packets, and the simulations of RED gateways in this paper use this approach.
One goal is to avoid a bias against bursty traffic. Networks contain connections with a range of burstiness, and gateways such as Drop Tail and Random Drop gateways have a bias against bursty traffic. With Drop Tail gateways, the more bursty the traffic from a particular connection, the more likely it is that the gateway queue will overflow when packets from that connection arrive at the gateway [7].
Another goal in deciding which connections to notify of congestion is to avoid the global synchronization that results from notifying all connections to reduce their windows at the same time. Global synchronization has been studied in networks with Drop Tail gateways [37], and results in loss of throughput in the network. Synchronization as a general network phenomena has been explored in [8].
In order to avoid problems such as biases against bursty traffic and global synchronization, congestion avoidance gateways can use distinct algorithms for congestion detection and for deciding which connections to notify of this congestion. The RED gateway uses randomization in choosing which arriving packets to mark; with this method, the probability of marking a packet from a particular connection is roughly proportional to that connection's share of the bandwidth through the gateway. This method can be efficiently implemented without maintaining per-connection state at the gateway.
One goal for a congestion avoidance gateway is the ability to control the average queue size even in the absence of cooperating sources. This can be done if the gateway drops arriving packets when the average queue size exceeds some maximum threshold (rather than setting a bit in the packet header). This method could be used to control the average queue size even if most connections last less than a roundtrip time (as could occur with modified transport protocols in increasingly high-speed networks), and even if connections fail to reduce their throughput in response to marked or dropped packets.
Okay. These technological monsters, Floyd & Jacobson, explain the problem and some outlines of how to approach it to be resolved. The solution they propose is overwhelming. Reading again from Floyd & Jacobson [13]:
This section describes the algorithm for RED gateways. The RED gateway calculates the average queue size, using a low-pass filter with an exponential weighted moving average. The average queue size is compared to two thresholds, a minimum threshold and a maximum threshold. When the average queue size is less than the minimum threshold, no packets are marked. When the average queue size is greater than the maximum threshold, every arriving packet is marked. If marked packets are in fact dropped, or if all source nodes are cooperative, this ensures that the average queue size does not significantly exceed the maximum threshold.
When the average queue size is between the minimum and the maximum threshold, each arriving packet is marked with probability pa, where pa is a function of the average queue size avg. Each time that a packet is marked, the probability that a packet is marked from a particular connection is roughly proportional to that connection's share of the bandwidth at the gateway. The general RED gateway algorithm is given in Figure 2.6.1.
Thus the RED gateway has two separate algorithms. The algorithm for computing the average queue size determines the degree of burstiness that will be allowed in the gateway queue. The algorithm for calculating the packet-marking probability determines how frequently the gateway marks packets, given the current level of congestion. The goal is for the gateway to mark packets at fairly evenly-spaced intervals, in order to avoid biases and to avoid global synchronization, and to mark packets sufficiently frequently to control the average queue size.
These guys are not joking. RED queuing discipline is a very ingenious algorithm. To understand this even better we are going to present a figure created by using the ns-2 network simulator, taken from NS by Example [14] and then retouched by me; the simulation has to deal with a a link where a RED, instead of a DropTail, queuing discipline is used. The figure represents the current queue length and the average queue length against time of the RED gateway used in the link. It can help us a lot. Have a look to figure 2.6.2 below.
The red curve is the current queue size of the router; the green curve is the average queue size calculated using the EWMA low-pass filter. The minimum threshold is 5 packets and the maximum threshold is 12 packets. Observe that current queue has a very severe burst of packets in the interval between 3 and 4.2 seconds, more or less. In fact, the maximum threshold of 12 packets is violated. But the smoothed average queue size calculated using the EWMA low-pass filter is a lot slower. It follows the current queue length but like a turtle follows a rabbit. The 'drop probability' is selected based on the average queue size, not the current queue size. But let's Parris [12] to explain this better for us; I did some minor changes to adapt his explanation to our figure:
RED's packet dropping decisions are mode-based. Figure 2.6.2 illustrates the ideas behind the RED algorithm. This figure shows the instantaneous (red color) and weighted average (green color) queue size (in packets) over time.
 
The current mode, indicated on the right hand side, is determined by the relation between the average queue size and the minimum and maximum thresholds. When the average queue size is less than the minimum threshold this indicates no congestion, so no action is taken. This is no drop mode (yellow band on the figure) and the probability that an arriving packet will be dropped is zero. In this mode arriving packets are always enqueued.
 
At the other extreme, when the average queue size is greater than the maximum threshold, or if the queue is full, the algorithm infers persistent and severe congestion. All arriving packets are dropped. The probability an arriving packet will be dropped is one. This mode is referred to as forced drop mode (red band on the figure).
 
Finally, when the average queue size is between the two thresholds the algorithm operates in a probabilistic (i.e., random) drop mode. In this mode, arriving packets are dropped at random. The probability that a given packet will be dropped ranges between zero and one as a function of three parameters: maxp, the current average queue size avg, and count. The input parameter, maxp, determines the maximum probability that two consecutive packets will be dropped while in probabilistic drop mode. The variable, count, tracks how many packets have been enqueued since the last packet was dropped.
 
Very nice. In figure 2.6.1 above the RED algorithm shows how the three parameters that Parris talked about are used to implement the queuing discipline behavior. Observe also that even when current queue size breaks the 12 packets maximum threshold when the burst of packets arrive between seconds 3 and 4.2 aproximately, no packet is dropped because the average queue size is below the maximum threshold. This way, the RED gateway manages intelligently bursty traffic while controlling the average queue size when the congestion becomes permanent.
 
In their paper, Floyd & Jacobson [13] showed that using RED gateways the network power is improved. Reading again from them:
 
Because RED gateways can control the average queue size while accommodating transient congestion, RED gateways are well-suited to provide high throughput and low average delay in high-speed networks with TCP connections that have large windows. The RED gateway can accommodate the short burst in the queue required by TCP's slow-start phase; thus RED gateways control the average queue size while still allowing TCP connections to smoothly open their windows. Figure 2.6.3 shows the results of simulations of the network below with two TCP connections, each with a maximum window of 240 packets, roughly equal to the delay-bandwidth product. The two connections are started at slightly different times. The simulations compare the performance of Drop Tail and of RED gateways.
 
  

In the figure the x-axis shows the total throughput as a fraction of the maximum possible throughput on the congested link. The y-axis shows the average queue size in packets (as seen by arriving packets). Our simulator does not use the 4.3-Tahoe TCP code directly but we believe it is functionally identical.
Five 5-second simulations were run for each of 11 set of parameters for Drop Tail gateways, and for 11 sets of parameters for RED gateways; each mark in the figure shows the result of one of these five-second simulations. The simulations with Drop Tail gateways were run with the buffer size ranging from 15 to 140 packets; as the buffer size is increased, the throughput and the average queue size increase correspondingly.
In order to avoid phase effects in the simulations with Drop Tail gateways, the source node takes a random time drawn from the uniform distribution on [0, t] seconds to prepare an FTP packet for transmission, where t is the bottleneck service time of 0.17 ms. [7].
The simulations with RED gateways were all run with a buffer size of 100 packets, with minth ranging from 3 to 50 packets. For the RED gateways, maxth is set to 3minth, with wq = 0.002 and maxp = 1/50.
The dashed lines show the average delay (as a function of throughput) approximated by 1.73/(1 - x) for the simulations with RED gateways, and approximated by 0.1/(1 - x)^3 for the simulations with Drop Tail gateways. For this simple network with TCP connections with large windows, the network power (the ratio of throughput to delay) is higher with RED gateways than with Drop Tail gateways. There are several reasons for this difference. With Drop Tail gateways with a small maximum queue, the queue drops packets while the TCP connection is in the slow-start phase of rapidly increasing its window, reducing throughput.
On the other hand, with Drop Tail gateways with a large maximum queue the average delay is unacceptably large. In addition, Drop Tail gateways are more likely to drop packets from both connections at the same time, resulting in global synchronization and a further loss of throughput.
Okay, having the theoretical aspect of RED more or less controlled let's see now how this queuing discipline is implemented on Linux. Reading from the RED man page, we have:
Random Early Detection is a classless qdisc which limits its queue size smartly. Regular queues simply drop packets from the tail when they are full, which may not be the optimal behaviour. RED also performs tail drop, but does so in a more gradual way.
Once the queue hits a certain average length, packets enqueued have a configurable chance of being marked (which may mean dropped). This chance increases linearly up to a point called the max average queue length, although the queue might get bigger.
This has a host of benefits over simple taildrop, while not being processor intensive. It prevents synchronous retransmits after a burst in traffic, which cause further retransmits, etc.
The goal is to have a small queue size, which is good for interactivity while not disturbing TCP/IP traffic with too many sudden drops after a burst of traffic.
Depending on whether ECN is configured, marking either means dropping or purely marking a packet as overlimit.
The average queue size is used for determining the marking probability. This is calculated using an Exponential Weighted Moving Average, which can be more or less sensitive to bursts.
When the average queue size is below min bytes, no packet wil ever be marked. When it exceeds min, the probability of doing so climbs linearly up to probability, until the average queue size hits max bytes. Because probability is normally not set to 100%, the queue size might conceivably rise above max bytes, so the limit parameter is provided to set a hard maximum for the size of the queue.
Okay. Bert Hubert explanation just round and confirm our understanding of this discipline. Next the parameters are explained:
min
 Average queue size at which marking becomes a possibility.
max
 At this average queue size, the marking probability is maximal. Should be at least twice min
to prevent synchronous retransmits, higher for low min.
probability
 Maximum probability for marking, specified as a floating point number from 0.0 to 1.0.
Suggested values are 0.01 or 0.02 (1 or 2%, respectively).
limit
 Hard limit on the real (not average) queue size in bytes. Further packets are dropped.
Should be set higher than max+burst. It is advised to set this a few times higher than max.
burst
 Used for determining how fast the average queue size is influenced by the real queue size.
Larger values make the calculation more sluggish, allowing longer bursts of traffic
before marking starts. Real life experiments support the following guideline:
(min+min+max)/(3*avpkt).
avpkt
 Specified in bytes. Used with burst to determine the time constant for average queue size
calculations. 1000 is a good value.
bandwidth
 This rate is used for calculating the average queue size after some idle time. Should be set
to the bandwidth of your interface. Does not mean that RED will shape for you! Optional.
ecn
 As mentioned before, RED can either 'mark' or 'drop'. Explicit Congestion Notification
allows RED to notify remote hosts that their rate exceeds theamount of bandwidth
available. Non-ECN capable hosts can only be notified by dropping a packet. If this
parameter is specified, packets which indicate that their hosts honor ECN will only be
marked and not dropped, unless the queue size hits limit bytes. Needs a tc binary with RED
support compiled in. Recommended.
And finally, to finish this long part we have to learn how to configure one of this RED beast using tc. First, a general command:
Yellows indicate values we have to supply. Let's see how:
<max> is maximum threshold. To set this value we use link bandwidth and maximum desired latency. Assuming a bandwidth of 512 kbps and a maximum desired latency of 500 ms we have:
 512 kbps ~ 512000 bps = 64000 bytes / sec
<max> = 64000 bytes / sec * 0.5 sec = 32000 bytes
Above <max> threshold we will have a packet massacre and latency doesn't matter.
<min> is minimum threshold. Must be set half of <max> threshold. Following Floyd & Jacobson, let's set <min> threshold such that <max> ~ 3 * <min>. Then we will set <min> to 12000 bytes.
<limit> is hard limit on the real queue size. This limit should never be reached. Following Lartc [7] recommendation let's set this as 8 * <max>; then <limit> will be 256000 bytes.
<avpkt> is average packet size. We set this to 1000 bytes.
<burst> allows longer burst of traffic before marking (dropping) starts; just to accommodate bursty flows. Following RED man page recommendations we have:
 <burst> = (2 * <min> + <max>) / (3 * <avpkt>)
<burst> = (2 * 12000 + 32000) / (3 * 1000) = 18.67 ~ 20.
<probability> is maximum probability of marking; using same value used by Floyd & Jacobson we set this to 0.02.
<bandwidth> is link bandwidth for calculating queue length when it's idle; then, bandwidth will be 512.
[ecn] is an optional parameter. If our end TCP systems are configured to respond to 'early congestion notification' you can use this flag to avoid packet dropping when average queue size is above <min> threshold and below <max> threshold. Above <max> threshold all packets are dropped; this perhaps occurs when dealing with unresponsive flows.
 
Well, we have all parameters completed; then using tc we order:
 
 
or,
 
 
if we have ecn responsive end-systems.
 
A final word about RED. When implementing it to make the work it was conceived to, this means, to feedback senders as soon as possible about incipient congestion, configure it as a root queue. It should be the queue faced to the sender. Don't configure it behind a real killer, as CBQ for instance, because you are spoiling RED nature.
 
And RED nature is very interesting for implementing some other behaviors instead of incipient congestion notification. Because it can drop packets at random based on a configurable probability parameter, it is tailor-made for implementing Assure Forwarding behavior. Do you remember? Same class, different drop precedences. Drop precedences can be implemented using RED. It was Werner Almesberger idea when they (Almesberger, Kuznetsov and Salim) created GRED, our next queuing discipline, of course.

http://opalsoft.net/qos/DS-26.htm