HISTORY & INTRODUCTION
HFSC - Hierarchical Fair Service Curve was first presented at SIGCOMM'97. Developed as a part of ALTQ (ALTernative Queuing) on NetBSD, found its way quickly to other BSD systems, and then a few years ago became part of the linux kernel. Still, it's not the most popular scheduling algorithm - especially if compared to HTB - and it's not well documented from enduser's perspective. This introduction aims to explain how HFSC works without going to deep into math side of things (although some if it will be inevitable). In short HFSC aims to:
-
- 1)
- guarantee precise bandwidth and delay allocation for all leaf classes (realtime criterion)
- 2)
- allocate excess bandwidth fairly as specified by class hierarchy (linkshare & upperlimit criterion)
- 3)
- minimize any discrepancy between the service curve and the actual amount of service provided during linksharing
Feature (2) is well, obvious - any algorithm featuring class hierarchy (such as HTB or CBQ) strives to achieve that. HFSC does that well, although you might end with unusual situations, if you define service curves carelessly - see section CORNER CASES for examples.
Feature (3) is mentioned due to the nature of the problem. There may be situations where it's either not possible to guarantee service of all curves at the same time, and/or it's impossible to do so fairly. Both will be explained later. Note that this is mainly related to interior (aka aggregate) classes, as the leafs are already handled by (1). Still - it's perfectly possible to create a leaf class w/o realtime service, and in such case - the caveats will naturally extend to leaf classes as well.
ABBREVIATIONS
For the remaining part of the document, we'll use following shortcuts:- RT - realtime LS - linkshare UL - upperlimit SC - service curve
BASICS OF HFSC
To understand how HFSC works, we must first introduce a service curve. Overall, it's a nondecreasing function of some time unit, returning amount of service (allowed or allocated amount of bandwidth) by some specific point in time. The purpose of it should be subconsciously obvious - if a class was allowed to transfer not less than the amount specified by its service curve - then service curve is not violated. Still - we need more elaborate criterion than just the above (although in most generic case it can be reduced to it). The criterion has to take two things into account:
-
- *
- idling periods
- *
- ability to "look back", so if during current active period service curve is violated, maybe it isn't if we count excess bandwidth received during earlier active period(s)
-
- (1)
- For each t1, there must exist t0 in set B, so S(t1-t0)~<=~w(t0,t1)
-
- a)
- our session was active during two periods, with a small time gap between them
- b)
- as in (a), but with a larger gap
If the gap is larger (b) - then it's less likely to happen (unless the excess bandwidth allocated during the 1st part was really large). Still, the larger the gap - the less interesting is what happened in the past (e.g. 10 minutes ago) - what matters is the current traffic that just started.
From HFSC's perspective, more interesting is answering the following question: when should we start transferring packets, so a service curve of a class is not violated. Or rephrasing it: How much X() amount of service should a session receive by time t, so the service curve is not violated. Function X() defined as below is the basic building block of HFSC, used in: eligible, deadline, virtual-time and fit-time curves. Of course, X() is based on equation (1) and is defined recursively:
-
- *
- At the 1st backlogged period beginning function X is initialized to generic service curve assigned to a class
- *
- At any subsequent backlogged period, X() is:
min(X() from previous period ; w(t0)+S(t-t0) for t>=t0),
... where t0 denotes the beginning of the current backlogged period.
The above generic X() can be one of the following:
-
- E()
- In realtime criterion, selects packets eligible for sending. If none are eligible, HFSC will use linkshare criterion. Eligible time 'et' is calculated with reference to packets' heads ( et~=~E^(-1)(w) ). It's based on RT service curve, but in case of a convex curve, uses its 2nd slope only.
- D()
- In realtime criterion, selects the most suitable packet from the ones chosen by E(). Deadline time 'dt' corresponds to packets' tails (dt~=~D^(-1)(w+l), where 'l' is packet's length). Based on RT service curve.
- V()
- In linkshare criterion, arbitrates which packet to send next. Note that V() is function of a virtual time - see LINKSHARE CRITERION section for details. Virtual time 'vt' corresponds to packets' heads (vt~=~V^(-1)(w)). Based on LS service curve.
- F()
- An extension to linkshare criterion, used to limit at which speed linkshare criterion is allowed to dequeue. Fit-time 'ft' corresponds to packets' heads as well (ft~=~F^(-1)(w)). Based on UL service curve.
REALTIME CRITERION
RT criterion ignores class hierarchy and guarantees precise bandwidth and delay allocation. We say that packet is eligible for sending, when current real time is bigger than eligible time. From all packets eligible, the one most suited for sending, is the one with the smallest deadline time. Sounds simply, but consider following example: Interface 10mbit, two classes, both with two-piece linear service curves:
-
- *
- 1st class - 2mbit for 100ms, then 7mbit (convex - 1st slope < 2nd slope)
- *
- 2nd class - 7mbit for 100ms, then 2mbit (concave - 1st slope > 2nd slope)
The effect of such E() - packets will be sent earlier, and at the same time D() will be updated - so current deadline time calculated from it will be bigger. Thus, when the 2nd class starts sending packets later, both the 1st and the 2nd class will be eligible, but the 2nd session's deadline time will be smaller and its packets will be sent first. When the 1st class becomes idle at some later point, the 2nd class will be able to "buffer" up again for later active period of the 1st class.
A short remark - in a situation, where the total amount of bandwidth available on the interface is bigger than the allocated total realtime parts (imagine interface 10 mbit, but 1mbit/2mbit and 2mbit/1mbit classes), the sole speed of the interface could suffice to guarantee the times.
Important part of RT criterion is that apart from updating its D() and E(), also V() used by LS criterion is updated. Generally the RT criterion is secondary to LS one, and used only if there's a risk of violating precise realtime requirements. Still, the "participation" in bandwidth distributed by LS criterion is there, so V() has to be updated along the way. LS criterion can than properly compensate for non-ideal fair sharing situation, caused by RT scheduling. If you use UL service curve its F() will be updated as well (UL service curve is an extension to LS one - see UPPERLIMIT CRITERION section).
Anyway - careless specification of LS and RT service curves can lead to potentially undesired situations (see CORNER CASES for examples). This wasn't the case in HFSC paper where LS and RT service curves couldn't be specified separately.
LINKSHARING CRITERION
LS criterion's task is to distribute bandwidth according to specified class hierarchy. Contrary to RT criterion, there're no comparisons between current real time and virtual time - the decision is based solely on direct comparison of virtual times of all active subclasses - the one with the smallest vt wins and gets scheduled. One immediate conclusion from this fact is that absolute values don't matter - only ratios between them (so for example, two children classes with simple linear 1mbit service curves will get the same treatment from LS criterion's perspective, as if they were 5mbit). The other conclusion is, that in perfectly fluid system with linear curves, all virtual times across whole class hierarchy would be equal. Why is VC defined in term of virtual time (and what is it) ?Imagine an example: class A with two children - A1 and A2, both with let's say 10mbit SCs. If A2 is idle, A1 receives all the bandwidth of A (and update its V() in the process). When A2 becomes active, A1's virtual time is already far bigger than A2's one. Considering the type of decision made by LS criterion, A1 would become idle for a lot of time. We can workaround this situation by adjusting virtual time of the class becoming active - we do that by getting such time "up to date". HFSC uses a mean of the smallest and the biggest virtual time of currently active children fit for sending. As it's not real time anymore (excluding trivial case of situation where all classes become active at the same time, and never become idle), it's called virtual time.
Such approach has its price though. The problem is analogous to what was presented in previous section and is caused by non-linearity of service curves:
- 1)
- either it's impossible to guarantee both service curves and satisfy fairness during certain time periods:
- Recall the example from RT section, slightly modified (with 3mbit slopes instead of 2mbit ones):
- *
- 1st class - 3mbit for 100ms, then 7mbit (convex - 1st slope < 2nd slope)
- *
- 2nd class - 7mbit for 100ms, then 3mbit (concave - 1st slope > 2nd slope)
- 2)
- and/or it's impossible to guarantee service curves of all classes at all
- Even if we didn't use virtual time and allowed a session to be "punished", there's a possibility that service curves of all classes couldn't be guaranteed for a brief period. Consider following, a bit more complicated example: Root interface, classes A and B with concave and convex curve (summing up to root), A1 & A2 (children of A), both with concave curves summing up to A, B1 & B2 (children of B), both with convex curves summing up to B.
Assume that A2, B1 and B2 are constantly backlogged, and at some later point A1 becomes backlogged. We can easily choose slopes, so that even if we "punish" A2 for earlier excess bandwidth received, A1 will have no chance of getting bandwidth corresponding to its first slope. Following from the above example:
A - 7mbit, then 3mbit A1 - 5mbit, then 2mbit A2 - 2mbit, then 1mbit B - 3mbit, then 7mbit B1 - 2mbit, then 5mbit B2 - 1mbit, then 2mbit
At the point when A1 starts sending, it should get 5mbit to not violate its service curve. A2 gets punished and doesn't send at all, B1 and B2 both keep sending at their 5mbit and 2mbit. But as you can see, we already are beyond interface's capacity - at 12mbit. A1 could get 3mbit at most. If we used virtual times and kept fairness property, A1 and A2 would send at 3mbit together with 5:2 ratio (so respectively at ~2.14mbit and ~0.86mbit).
UPPERLIMIT CRITERION
UL criterion is an extensions to LS one, that permits sending packets only if current real time is bigger than fit-time ('ft'). So the modified LS criterion becomes: choose the smallest virtual time from all active children, such that fit-time < current real time also holds. Fit-time is calculated from F(), which is based on UL service curve. As you can see, it's role is kinda similar to E() used in RT criterion. Also, for obvious reasons - you can't specify UL service curve without LS one. Main purpose of UL service curve is to limit HFSC to bandwidth available on the upstream router (think adsl home modem/router, and linux server as nat/firewall/etc. with 100mbit+ connection to mentioned modem/router). Typically, it's used to create a single class directly under root, setting linear UL service curve to available bandwidth - and then creating your class structure from that class downwards. Of course, you're free to add UL service (linear or not) curve to any class with LS criterion.Important part about UL service curve is, that whenever at some point in time a class doesn't qualify for linksharing due to its fit-time, the next time it does qualify, it will update its virtual time to the smallest virtual time of all active children fit for linksharing. This way, one of the main things LS criterion tries to achieve - equality of all virtual times across whole hierarchy - is preserved (in perfectly fluid system with only linear curves, all virtual times would be equal).
Without that, 'vt' would lag behind other virtual times, and could cause problems. Consider interface with capacity 10mbit, and following leaf classes (just in case you're skipping this text quickly - this example shows behavior that doesn't happen):
A - ls 5.0mbit B - ls 2.5mbit C - ls 2.5mbit, ul 2.5mbitIf B was idle, while A and C were constantly backlogged, they would normally (as far as LS criterion is concerned) divide bandwidth in 2:1 ratio. But due to UL service curve in place, C would get at most 2.5mbit, and A would get the remaining 7.5mbit. The longer the backlogged period, the more virtual times of A and C would drift apart. If B became backlogged at some later point in time, its virtual time would be set to (A's~vt~+~C's~vt)/2, thus blocking A from sending any traffic, until B's virtual time catches up with A.
SEPARATE LS / RT SCs
Another difference from original HFSC paper, is that RT and LS SCs can be specified separately. Moreover - leaf classes are allowed to have only either RT SC or LS SC. For interior classes, only LS SCs make sense - Any RT SC will be ignored.CORNER CASES
Separate service curves for LS and RT criteria can lead to certain traps, that come from "fighting" between ideal linksharing and enforced realtime guarantees. Those situations didn't exist in original HFSC paper, where specifying separate LS / RT service curves was not discussed. Consider interface with capacity 10mbit, with following leaf classes:A - ls 5.0mbit, rt 8mbit B - ls 2.5mbit C - ls 2.5mbitImagine A and C are constantly backlogged. As B is idle, A and C would divide bandwidth in 2:1 ratio, considering LS service curve (so in theory - 6.66 and 3.33). Alas RT criterion takes priority, so A will get 8mbit and LS will be able to compensate class C for only 2 mbit - this will cause discrepancy between virtual times of A and C.
Assume this situation lasts for a lot of time with no idle periods, and suddenly B becomes active. B's virtual time will be updated to (A's~vt~+~C's~vt)/2, effectively landing in the middle between A's and C's virtual time. The effect - B, having no RT guarantees, will be punished and will not be allowed to transfer until C's virtual time catches up.
If the interface had higher capacity - for example 100mbit, this example would behave perfectly fine though.
Let's look a bit closer at the above example - it "cleverly" invalidates one of the basic things LS criterion tries to achieve - equality of all virtual times across class hierarchy. Leaf classes without RT service curves are literally left to their own fate (governed by messed up virtual times).
Also - it doesn't make much sense. Class A will always be guaranteed up to 8mbit, and this is more than any absolute bandwidth that could happen from its LS criterion (excluding trivial case of only A being active). If the bandwidth taken by A is smaller than absolute value from LS criterion, the unused part will be automatically assigned to other active classes (as A has idling periods in such case). The only "advantage" is, that even in case of low bandwidth on average, bursts would be handled at the speed defined by RT criterion. Still, if extra speed is needed (e.g. due to latency), non linear service curves should be used in such case.
In the other words - LS criterion is meaningless in the above example.
You can quickly "workaround" it by making sure each leaf class has RT service curve assigned (thus guaranteeing all of them will get some bandwidth), but it doesn't make it any more valid.
LINUX AND TIMER RESOLUTION
In certain situations, the scheduler can throttle itself and setup so called watchdog to wakeup dequeue function at some time later. In case of HFSC it happens when for example no packet is eligible for scheduling, and UL service curve is used to limit the speed at which LS criterion is allowed to dequeue packets. It's called throttling, and accuracy of it is dependent on how the kernel is compiled. There're 3 important options in modern kernels, as far as timers' resolution goes: 'tickless system', 'high resolution timer support' and 'timer frequency'.If you have 'tickless system' enabled, then the timer interrupt will trigger as slowly as possible, but each time a scheduler throttles itself (or any other part of the kernel needs better accuracy), the rate will be increased as needed / possible. The ceiling is either 'timer frequency' if 'high resolution timer support' is not available or not compiled in. Otherwise it's hardware dependent and can go far beyond the highest 'timer frequency' setting available.
If 'tickless system' is not enabled, the timer will trigger at a fixed rate specified by 'timer frequency' - regardless if high resolution timers are or aren't available.
This is important to keep those settings in mind, as in scenario like: no tickless, no HR timers, frequency set to 100hz - throttling accuracy would be at 10ms. It doesn't automatically mean you would be limited to ~0.8mbit/s (assuming packets at ~1KB) - as long as your queues are prepared to cover for timer inaccuracy. Of course, in case of e.g. locally generated udp traffic - appropriate socket size is needed as well. Short example to make it more understandable (assume hardcore anti-schedule settings - HZ=100, no HR timers, no tickless):
tc qdisc add dev eth0 root handle 1:0 hfsc default 1 tc class add dev eth0 parent 1:0 classid 1:1 hfsc rt m2 10mbitAssuming packet of ~1KB size and HZ=100, that averages to ~0.8mbit - anything beyond it (e.g. the above example with specified rate over 10x bigger) will require appropriate queuing and cause bursts every ~10 ms. As you can imagine, any HFSC's RT guarantees will be seriously invalidated by that. Aforementioned example is mainly important if you deal with old hardware - as it's particularly popular for home server chores. Even then, you can easily set HZ=1000 and have very accurate scheduling for typical adsl speeds.
Anything modern (apic or even hpet msi based timers + 'tickless system') will provide enough accuracy for superb 1gbit scheduling. For example, on one of basically cheap dual core AMD boards I have with following settings:
tc qdisc add dev eth0 parent root handle 1:0 hfsc default 1 tc class add dev eth0 paretn 1:0 classid 1:1 hfsc rt m2 300mbitAnd simple:
nc -u dst.host.com 54321 </dev/zero nc -l -p 54321 >/dev/null...will yield following effects over period of ~10 seconds (taken from /proc/interrupts):
319: 42124229 0 HPET_MSI-edge hpet2 (before) 319: 42436214 0 HPET_MSI-edge hpet2 (after 10s.)That's roughly 31000/s. Now compare it with HZ=1000 setting. The obvious drawback of it is that cpu load can be rather extensive with servicing that many timer interrupts. Example with 300mbit RT service curve on 1gbit link is particularly ugly, as it requires a lot of throttling with minuscule delays.
Also note that it's just an example showing capability of current hardware. The above example (essentially 300mbit TBF emulator) is pointless on internal interface to begin with - you will pretty much always want regular LS service curve there, and in such scenario HFSC simply doesn't throttle at all.
300mbit RT service curve (selected columns from mpstat -P ALL 1):
10:56:43 PM CPU %sys %irq %soft %idle 10:56:44 PM all 20.10 6.53 34.67 37.19 10:56:44 PM 0 35.00 0.00 63.00 0.00 10:56:44 PM 1 4.95 12.87 6.93 73.27So, in rare case you need those speeds with only RT service curve, or with UL service curve - remember about drawbacks.
NAME
HFSC - Hierarchical Fair Service Curve's control under linuxSYNOPSIS
tc qdisc add ... hfsc [ default CLASSID ] tc class add ... hfsc [ [ rt SC ] [ ls SC ] | [ sc SC ] ] [ ul SC ] rt : realtime service curve ls : linkshare service curve sc : rt+ls service curve ul : upperlimit service curve * at least one of rt, ls or sc must be specified * ul can only be specified with ls or sc
- SC := [ [ m1 BPS ] d SEC ] m2 BPS
- m1 : slope of the first segment d : x-coordinate of intersection m2 : slope of the second segment
- SC := [ [ umax BYTE ] dmax SEC ] rate BPS
- umax : maximum unit of work dmax : maximum delay rate : rate
DESCRIPTION (qdisc)
HFSC qdisc has only one optional parameter - default. CLASSID specifies the minor part of the default classid, where packets not classified by other means (e.g. u32 filter, CLASSIFY target of iptables) will be enqueued. If default is not specified, unclassified packets will be dropped.DESCRIPTION (class)
HFSC class is used to create a class hierarchy for HFSC scheduler. For explanation of the algorithm, and the meaning behind rt, ls, sc and ul service curves - please refer to tc-hfsc(7). As you can see in SYNOPSIS, service curve (SC) can be specified in two ways. Either as maximum delay for certain amount of work, or as a bandwidth assigned for certain amount of time. Obviously, m1 is simply umax/dmax.Both m2 and rate are mandatory. If you omit other parameters, you will specify linear service curve.