Thursday, August 25, 2011

linux filters and queuing discipline

TC is used to configure traffic control in Linux kernel. Traffic control consists of the following:
1) Shaping: Shaping is done at the outgoing interface, and includes throttling the bandwidth and/or smoothing the traffic bursts of the outgoing flows.
2) Scheduling: Scheduling is also done at the outgoing interface and makes it possible to improve interactivity for traffic that needs it while still guaranteeing bandwidth to bulk transfers.
3) Policing: Policing is done at the ingress and is primarily used for throttling the rate at which flows may arrive. Dropping is a severe form of policing.


Qdiscs are building blocks of the traffic control system. When a packet is to be sent to an interface by the kernel, it is enqueued to the qdisc configured for that interface. The kernel then tries to get as many packets as possible from the qdisc, for handing them to the network adaptor driver.

The qdiscs may be categorized into classless and class-full qdiscs. A class-full qdisc contains classes, which contain further qdisc, which in turn may again have classes and so on. Class-full qdiscs can be viewed as follows:
Fig. 8.2 Class-full discipline [25]
The class-full qdiscs also need filters to classify the incoming packets into the various classes within them. Here is a list of some of the filters available under linux:

Classless qdiscs:
Packet First In First Out (PFIFO): This is a simple FIFO qdisc and since the buffer size is specified in packets, it is known as PFIFO. This is the simplest of qdiscs and may be used for EF traffic hence providing minimum latency.

Token Bucket Filter (TBF): This filter is exactly based on the token bucket/ leaky bucket model and is used for slowing down an interface.

Random Early Detect (RED): RED calculates the average queue size, using a low-pass filter with an exponential weighted moving average. If average queue size is less than the minimum threshold, no packets are marked/dropped. If the average queue size is greater than the maximum threshold, each arriving packet is marked/dropped. Else if, the average queue size is between the minimum and maximum threshold, the packets are marked with a probability P, which increases linearly with the increase in average queue size and reaches a maximum of Pmax at the maximum threshold.
RED gateways perform better than drop tail gateways which are biased towards short term bursts. RED, on the other hand, is tolerant towards short term bursts. Also it takes care of the global synchronizing problem faced by the TCP flows, which leads to large oscillations in the bitrate.

Classful qdiscs:
Priority Queuing Discipline (PRIO): In PRIO, packets are first classified using the filter and placed into different priority queues, which by default are three. Packets are scheduled from the head of a given queue only if all queues of higher priority are empty. Within each of the priority queues, packets are scheduled in FIFO order.

DSMARK qdisc: this qdisc is used for marking the packets. With the ssset_tc_index option set, the TOS field from skbàiphàtos (the packet buffer is accessed using the skb pointer) is copied to the skbàtc_index (16-bit field) field. Next the tcindex classifier reads the skbàtc_index field and performs the following operation: (skbÃtc_index & mask)>>shift. Mask and shift are the parameters that can be passed with the tcindex classifier using tc. The value returned after this operation (16 bit value) is then re-written by the qdisc back to the skbàtc_index. The DSMARK classes and appropriate filters, packets can be marked/re-marked. Complete understanding of the processes involved is only possible through an example, which will be discussed in the next section.

Generalized RED: This qdisc was specifically designed keeping in mind the need for implementing multiple drop precedence in AF PHB. GRED can have a maximum of 16 RED virtual queues (VQs) each of which can be configured independently. GRED takes advantage of the probabilistic dropping mechanism in RED, to implement AF classes. The drop priority among the VQs is chosen using the four least significant bits (0¦15 for 16 VQs) in the skbàtc_index field.

Hierarchical Token Bucket (HTB): The term hierarchical implies that the filter allows hierarchical link sharing on an interface. For example:
Fig. 8.3 HTBTree
For each of the sub-classes the minimum (rate) and maximum bandwidth (ceil) can be specified. Rate is the minimum bandwidth that will be guaranteed to a class when all the classes are under-provisioned. Excess (remaining) bandwidth is distributed to classes which request more bandwidth.

HTB is based on TBF and Deficit Round Robin (DRR). DRR is a modified version of Weighted Round Robin (WRR). In DRR, the scheduler associates with each connection a deficit counter which is initialized to 0. In round robin manner, the scheduler visits each connection and the packet is serviced only if its size is less than or equal to one quantum; if the size is larger the quantum is added to connections deficit counter. If the scheduler visits a connection wherein sum of the deficit counter and the quantum is larger than the size of the packet, the packet is served and the number of bytes equal to the size of the packets is subtracted from the deficit counter. Also, when a connection goes empty, its deficit counter is initialized to zero.

Qdisc and class identification: All qdiscs and classes have IDs which consist of a major number and a minor number separated by a colon. A qdisc is assigned a major number called a handle for e.g. 10:, while the minor number namespace is left for the classes. And the classes residing under a qdisc share their qdiscs major number but each class has its own minor number.