Saturday, September 3, 2011

GRED queuing discipline


To explain why GRED was invented who's going to do that better than their creators, Werner Almesberger, Jamal Hadi Salim (C sources indicate that Jamal wrote this code) and Alexey Kuznetsov? Reading from their paper "Differentiated Service on Linux" [10], we have:
Finally, we need a queuing discipline to support multiple drop priorities as required for Assured Forwarding. For this, we designed GRED, a generalized RED. Besides four delay priorities, which can again be implemented with already existing components, Assured Forwarding also needs three drop priorities, which is more than the current implementation of RED supports. We therefore added a new queuing discipline which we call "generalized RED" (GRED). GRED uses the lower bits of skb->tc_index to select the drop class and hence the corresponding set of RED parameters.

http://opalsoft.net/qos/DS-27.htm
Let's go slowly because this part requires to be studied carefully. We saw in previous section that RED has an inherent mechanism to mark packets randomly. Mark just means to mark packets for notifying senders of incipient congestion, or merely, dropping them. TCP with ECN option enabled can respond to ECN marked packets; TCP without ECN responds only to dropped packets. UDP and other unresponsive flows respond to nothing and packet killing is the only way to control them.
 
Independently if we mark or drop packets to notify senders about congestion, RED queuing discipline has the advantage that it can do that randomly using a probabilistic approach. Its original design was to be as fair as possible with entering flows, picking packets at random for trying to resolve the clustering nature of TCP behavior, for trying to control the queue length at routers and for trying to notify as soon as possible to the senders about congestion, for them to take correcting actions.
 
But for DS, we will not use RED this way, instead we want to take advantage of its probabilistic dropping mechanism to implement AF classes. Each of these classes need to support a three level drop precedence being the difference between levels the dropping probability. RED only offer one queue and one dropping probability, then our friends exhanced it to support, because 16 is better than 3 for computers, a 16 level of dropping probabilities just by creating 16 virtual queues with independent parameter setting. That's just GRED. 16 REDs in one, among other little tricks that we will see later.
 
At this time it would be very nice that a diagram comes to save us of this long-winded speech; here we have one:
 
 
  

Okay, it looks like a full class queue. Each class has its own virtual queue. Also it has some kind of filter to select the queue each packet is going to be placed. Here the tricks begin. Designers extend the packet buffer on Linux to add a new field called tc_index. The packet buffer is accessed using a pointer called skb. Then to access the field they use skb->tc_index. The new field in an integer and its 4-rightmost (least significant) bits are used to select the virtual queue (VQ) where a packet is going to be located when the time of doing this comes.
The real process is really of a tricky nature that for us, simple mortals, it's good enough to know that when the packet is in front of GRED, the 4-rightmost bits of its tc_index field buffer are snooped and accordingly to its value (0, 1, 2, … 15), one of the sixteen VQs is selected to receive the packet. It is very important to be clear that the tc_index field is not on the packet header, but instead, in the memory buffer assigned to manage the packet within the Linux Traffic Control process.
Another queuing discipline, DSMARK, that we will see later, is in charge of copying the DS field in the packet header onto the skb->tc_index field (previously it applies some bitwise operation to the DS field to convert it to a DSCP), and previously to the packet to be dispatched to the interface's driver, of copying back the skb->tc_index field onto the DS field in the packet header (previously it applies an inverse bitwise operation to the skb->tc_index field to convert the DSCP it contains back to a DS). All this mischievous process is executed by the DSMARK queuing discipline and by some mechanism called tc_index classifier, as we will see better, we hope, later on.
To continue let's see in short the GRED parameters; as you hope they are very similar what we saw before when talking about RED, but this time being GRED a multi-RED queue it is configured in two steps. First, the generic parameters are configured to select number of VQs, default VQ and the buffer sharing scheme. tc generic configuration is as follows:
Again, yellows indicate values we have to supply. Let's see this in detail. This command sets our GRED monster as the root queue of ethernet interface 0. <VQs> will be the number of virtual queues we want. If we were trying to assemble an AF class (we need one GRED for each AF class to be implemented), <VQs> parameter should be set to 3. Its value goes from 1 (a simple RED queue) to 16 (a 16 multi-RED queue). Then, 'setup DPs 3' is very nice for us.
Next, we have to select a default queue from our queues. This queue will be selected when 4-rightmost 
skb->tc_index
 bits lay out of the number of VQs we selected. For example, you select 3 virtual queues and a packet arrives with its bits set to 9; this case the packet will be assigned to the default VQ. For our example the last VQ is good enough, then 'default 3' is okay.
Finally, we have an optional parameter [grio]. Adding it we will get what is called a 'Priority Buffer Scheme'. It is something similar what we saw before with PRIO qdisc, but a little bit complicated. Let's see how to explain this. If you set your GRED queuing discipline as 'grio' you can assign a priority to each VQ. Priorities range from 1 to 16 with 1 being the highest. As you remember when reading about RED, discipline behavior depends on the calculated value of the average queue length; let's call this value qave to use the same token used in the GRED C code. When qave is below minimum thresholdwe are in no-drop state. When qave is above minimum threshold and below maximum threshold we are in probabilistic-drop state. When qave is above maximum threshold we are in force-dropstate.
All this means that qave value is crucial to define the queuing behavior. Having defined minimum threshold and maximum threshold as larger qave is, it's going to be higher the packet dropping probability. Then playing with qave value you could implement some kind of priority scheme. How the GRED designers did this? Well, when dealing with a priority 1 VQ the qave value is just its value (just the calculated value based on measurement). But when dealing with a priority 2 VQ its qave value will be its own calculated value plus the qave value of the priority 1 VQ. The qave value of a priority 5 VQ will be its own value plus the qave values of all VQs with higher priority index; this case qave from VQ from priority 4 to priority 1.
I don't like too much this scheme because it seems a little bit confuse for me. Lower priority VQs will have a qave value that will depend of qave of higher priority VQs. Of course, using this scheme we are creating a priority response. Higher priority VQs will be in better position to dispatch a packet than lower ones. But, because thresholds are individually set things complicate further. I feel that we can't control things. Well, it's a matter of personal taste; I prefere the simple scheme where we can define each VQ behaviour based on its own parameters. Anyway below is an example using grio option.
GRED using 'grio' could be used to implement an in-profile and out-of-profile scheme into the same drop precedence. For example, using two VQs with the same drop probability and different priorities. Differentiated Service architecture strength is that it establishes just a framework where we have, as joiners, enough field to play the game we want.
Well, summarizing our GRED generic command configuration could be:
for using the default buffer scheme, and:
for using the priority buffer scheme.
The second step is to set parameters for individual virtual queues. This case, tc is used as follows:
As you see, configuration is very similar what we used before with RED. Now the word 'change' is used instead of 'add' because the queue already exists created by the generic command. This time we are just 'changing' the virtual queue setting.
Comparing with RED new parameters are: DP <vq> where <vq> is the number, from 1 to 16, of the VQ we are configuring; and the optional parameter [ prio <number> ] where <number> is the priority assigned to this VQ. This parameter is to be used only when 'grio' option is used in the generic configuration. Let's see an example with 'grio' option:
To continue we would like to configure one of these 'severe doorman' completely as root on our ethernet interface 0 using a priority scheme. We begin explaining our goal with a diagram:
This is a very simple intranet domain on a 10 Mbps ethernet network. To the left the servers network is implemented under address space 192.168.1/24 having five servers: SQL, WWW, FTP, DNS and MAIL server. To the right the users network is implemented under address space 192.168.2/24 and is connected to the servers network through a Linux Router having the ethernet interface 1 on the servers side and the ethernet interface 0 on the users side. Also we have some VoIP traffic coming from Internet to our users network through the Cisco router. This traffic has to be protected when it reaches our network from some other stronger traffic. Have a look at Voice over IP in this site for more information about these fragile flows.
Our goal is to implement a GRED queing discipline on Linux Router's ethernet interface eth0. Our monster is going to be an eight virtual queues GRED where VQ1 will correspond to the VoIP traffic, VQs 2 to 6 will correspond to DNS, SQL, WWW, FTP, and MAIL server respectively, VQ7 will correspond to UDP flows (other than UDP generated by our DNS server) and VQ8 will be reserved as default for rest of the network flows. We want to prioritize traffic from priority 1 to 8 as follows: VoIP, DNS, SQL, WWW, FTP, MAIL, other UPD and rest of traffic.
We need some expected bandwidth distribution, maximum desired latency, estimated average packet length and packet dropping probability. Using the given customer requeriments and after thinking somewhat we decided to use the following table:
VQ8 will be our default queue. N/A means that we will not apply drop probability to UDP flows. Below you can find an explanation for this.
For each individual flow we will have:
                                    0.01 * Bandwidth Share * Desired Latency * Network Bandwidth
Maximum Threshold = -------------------------------------------------------------------
                                                                  8 bits/byte* 1000 ms/sec
Minimum Threshold = 1/2 * Maximum Threshold
Avpkt = Average Packet Length
Burst = ( 2 * MinThreshold + MaxThreshold) / ( 3 * Avpkt )
Limit = 4 * MaxThreshold

Network Bandwidth (10Mbps) = 10 * 1024 * 1024 bits/sec = 10.485.760 bits/sec
For example, for the SQL flow we will have:
                                    0.01 * 25 * 100 * 10.485.760
Maximum Threshold = ------------------------------- = 32768 bytes
                                                 8 * 1000
Minimum Threshold = 1/2 * 32768 = 16384
Avpkt = 1024
Burst = ( 2 * 32768 + 16384 ) / ( 3 * 1024 ) = 27
Limit = 4 * 32768 = 131072
After some calculation we can fill our table completely:
 
 
Last column is the number of packets per queue when minimum threshold is almost reached. This value is very important to check that minimum threshold admits at least one entire packet (this means that each type of flow can have at least one packet enqueue in no-drop state). Below total buffer required is calculated as 475005 bytes.
 
Let's explain now why we set MaxThreshold = MinThreshold (really we can't be set them equal because tc complaints when doing this; but we set them very close) for UDP flows where drop probability is not a matter (in fact we set this to 1 in the configuration script; see below).
 
When things go nice and all flows approximate to their minimum threshold no packet is dropped and everybody is happy. When going above minimum threshold packets begin to be dropped at random but because UDP flows are unresponsive (they don't adjust their throughput themselve when packets are being dropped) we don't want them to starve TCP flows that are aggresively responsive. Then we set maximum threshold equal to minimum threshold on UDP flows to guarantee them just the minimum threshold occupancy, but not more than this; then for UDP flows every packet will be dropped above the minimum threshold . As you see dropping probability is not a matter in this case because it is zero (0%) when queue occupancy is below common minimum-maximum threshold and 1 (100%) when above.
 
  

Okay. We are ready to configure the ethernet interface eth0. Here we have the commands (don't forget that they have to be typed in one line):
Generic GRED setup
VoIP service
DNS server service
SQL server service
WWW server service
FTP server service
MAIL server service
OTHER-UDP services
OTHER services
Checking our GRED queue using 'tc qdisc show' we have:# tc -d -s qdisc show dev eth0
qdisc gred 8002:
 DP:1 (prio 1) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0)
limit 3146b min 393b max 394b ewma 2 Plog 1 Scell_log 5

DP:2 (prio 2) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0)
limit 5243b min 655b max 656b ewma 2 Plog 1 Scell_log 6

DP:3 (prio 3) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0)
limit 128Kb min 16Kb max 32Kb ewma 4 Plog 21 Scell_log 10

DP:4 (prio 4) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0)
limit 41943b min 5243b max 10486b ewma 4 Plog 20 Scell_log 9

DP:5 (prio 5) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0)
limit 104858b min 13107b max 26214b ewma 4 Plog 19 Scell_log 10

DP:6 (prio 6) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0)
limit 52429b min 6554b max 13107b ewma 3 Plog 18 Scell_log 9

DP:7 (prio 7) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 0 (bytes 0)
limit 31457b min 3932b max 3933b ewma 2 Plog 1 Scell_log 8

DP:8 (prio 8) Average Queue 0b Measured Queue 0b
Packet drops: 0 (forced 0 early 0)
Packet totals: 72 (bytes 7488)
limit 104858b min 13107b max 26214b ewma 4 Plog 19 Scell_log 10
Sent 7488 bytes 72 pkts (dropped 0, overlimits 0)
Well, first half of the work is done; first half because we have to implement some way of placing different service packets on their specific virtual queues. We are going to do that on ethernet interface eth1 using what is known as an ingress queue. This queue combined with the u32 classifier will be good enough to reach our goal. But first, what is an ingress queue? Lartc [7] has the answer; reading from it:
All qdiscs discussed so far are egress qdiscs. Each interface however can also have an ingress qdisc which
is not used to send packets out to the network adaptor. Instead, it allows you to apply tc filters to packets
coming in over the interface, regardless of whether they have a local destination or are to be forwarded.
As the tc filters contain a full Token Bucket Filter implementation, and are also able to match on the
kernel flow estimator, there is a lot of functionality available. This effectively allows you to police
incoming traffic, before it even enters the IP stack.
The ingress qdisc itself does not require any parameters. It differs from other qdiscs in that it does not
occupy the root of a device. Attach it like this:
 # tc qdisc add dev eth0 ingress
This allows you to have other, sending, qdiscs on your device besides the ingress qdisc.
Very nice and clear. Simple command above creates an ingress queue on device eth0 and numbers it by default as ffff:. We can also number the queue explicitely; for example to create an ingress queue in our Linux router ethernet interface eth1 we just type:
Identifying flows from our servers is very easy; we just use source address as the selection criteria to do this. This way DNS, SQL, WWW, FTP and MAIL internal provided services are easily identified.
UDP protocol packets are also easily identified. Our problem is that it is not as easy to identify properly VoIP packets from other UDP packets. Then we are going to use a trick with the help of netfilter. Nevertheless it is not an infallible solution. Some non-VoIP packets will be accepted by our not so tricky rule. Well, I haven't found a better solution for this. Then we will go ahead with what we have.
There is an iptables extension called  'length patch' by James Morris; have a look to Netfilter Extensions HOWTO [15] for more information. The extension adds a new match that allows you to match a packet based on its length. Because VoIP packets are very small we will match as VoIP packets all UDP packets being their length between 40 and 160 bytes. Some non-VoIP packets are going to trick us, but as I told you before, if you have a better solution please let me know as soon as you can. Also, we will match only port numbers above 5000 that correspond to general applications (current practice within the Unix operating system where port numbers below 1024 can only be used by privileged processes and port numbers between 1024 and 5000 are automatically assigned by the operating system). This way we can filter, for example, DNS UDP (port 53) packets from VoIP UDP packets.
We will mark VoIP packets (and some other UDP dwarfty) entering eth1 using this iptables command:
To place packets on different GRED VQs we will use the u32 classifier. Let's see an example of this monster:
This command will set to 1 the skb->tc_index of packets entering eth1 coming from host 192.168.1.1 (or having 192.168.1.1 as source address). flowid :1 is the instruction that makes this possible. Then when being forwarded to our outgoing eth0 interface those packets will be placed on GRED VQ number 1. (GRED uses the lower bits of skb->tc_index to select the VQ; read this somewhere above) Great!!
What about VoIP packets? We will use a different approach but using the same tool; let's see how:
This time we are using fwmark to select our packets. All of them having their fwmark set to 2 (handle 2 fw) will be set to 4 on their skb->tc_index (flowid :4). And we can set the fwmark using theiptables command above (see command cm-30). Ja, Ja...  We have all those packets controlled, haven't we? Using iptables we mark fwmark for small UDP packets as 2, for example, and then with thefw filter we set the skb->tc_index to 4 for these packets. But this means that they will be put on GRED VQ number 4 when being forwarded to the interface eth0. Great, again!!
And, what to do with other UDP packets? Very easy. UDP is protocol number 17. Then we can use this u32 filter variation:
This rule matches UDP packets putting them on GRED VQ number 5. But observe that now the prio option is 2. Filter are treated following the prio option. prio 1 is treated first, then prio 2, an so on. Then we put our small UDP packets (VoIP) on a prio1 filter to place them on some GRED VQ; this filter will be treated first and all the small UDP packets will be placed on GRED VQ 1, for example. Having next another filter with a higher prio number, we can select rest of UDP packets placing them on another GRED VQ. It's very nice!! Isn't it?
Okay, fellows. Having all these explanations 'on-board' we have here the filter commands for each one of our services:
VoIP service
 
 
DNS server service
 
 
SQL server service
 
 
WWW server service
 
 
 
  

FTP server service
MAIL server service
OTHER-UDP services
Rest of flows being not catched by our filters will be placed on default VQ number 8.
Well, with this explanation we have finished with GRED. Our next target in our DS tour will be HTB (Hierarchical Token Bucket) queuing discipline, a new qdisc from Martin Devera that was included on Linux beginning with the kernel 2.4.20. I invite you to keep on reading our next sectio