6. Classless Queuing Disciplines (qdiscs)

Each of these queuing disciplines can be used as the primary qdisc on an interface, or can be used inside a leaf class of a classful qdiscs. These are the fundamental schedulers used under Linux. Note that the default scheduler is the pfifo_fast.

6.1. FIFO, First-In First-Out (pfifo and bfifo)

Note

Although a FIFO is one of the simplest elements of a queueing system, neither pfifo nor bfifo is the default qdisc on Linux interfaces. Be certain to see Section 6.2, “pfifo_fast, the default Linux qdisc” for the full details on the default (pfifo_fast) qdisc.

6.1.1. pfifo, bfifo Algorithm

The FIFO algorithm forms the underlying basis for the default qdisc on all Linux network interfaces (pfifo_fast). It performs no shaping or rearranging of packets. It simply transmits packets as soon as it can after receiving and queuing them. This is also the qdisc used inside all newly created classes until another qdisc or a class replaces the FIFO.

Figure 7: FIFO qdisc

A list of packets is maintained, when a packet is enqueued it gets inserted at the tail of a list. When a packet needs to be sent out to the network, it is taken from the head of the list.

6.1.2. limit parameter

A real FIFO qdisc must, however, have a size limit (a buffer size) to prevent it from overflowing in case it is unable to dequeue packets as quickly as it receives them. Linux implements two basic FIFO qdiscs, one based on bytes, and one on packets. Regardless of the type of FIFO used, the size of the queue is defined by the parameter limit. For a pfifo (Packet limited First In, First Out queue) the unit is understood to be packets and for a bfifo (Byte limited First In, First Out queue) the unit is understood to be bytes.

For pfifo, defaults to the interface txqueuelen , as specified with ifconfig or ip. The range for this parameter is [0, UINT32_MAX].

For bfifo, it defaults to the txqueuelen multiplied by the interface MTU. The range for this parameter is [0, UINT32_MAX] bytes. Note: The link layer header was considered when counting packet length.

Example 6. Specifying a limit for a packet or byte FIFO

[root@leander]# cat bfifo.tcc
/*
 * make a FIFO on eth0 with 10kbyte queue size
 *
 */

dev eth0 {
    egress {
        fifo (limit 10kB );
    }
}
[root@leander]# tcc < bfifo.tcc
# ================================ Device eth0 ================================

tc qdisc add dev eth0 handle 1:0 root dsmark indices 1 default_index 0
tc qdisc add dev eth0 handle 2:0 parent 1:0 bfifo limit 10240
[root@leander]# cat pfifo.tcc
/*
 * make a FIFO on eth0 with 30 packet queue size
 *
 */

dev eth0 {
    egress {
        fifo (limit 30p );
    }
}
[root@leander]# tcc < pfifo.tcc
# ================================ Device eth0 ================================

tc qdisc add dev eth0 handle 1:0 root dsmark indices 1 default_index 0
tc qdisc add dev eth0 handle 2:0 parent 1:0 pfifo limit 30
      

6.1.3. tc –s qdisc ls

The output of tc -s qdisc ls contains the limit, either in packets or in bytes, and the number of bytes and packets actually sent. An unsent and dropped packet only appears between braces and is not counted as 'Sent'.

In this example, the queue length is 100 packets, 45894 bytes were sentover 681 packets. No packets were dropped, and as the pfifo queue does not slow down packets, there were also no overlimits:

$ tc -s qdisc ls dev eth0

    qdisc pfifo 8001: dev eth0 limit 100p
    Sent 45894 bytes 681 pkts (dropped 0, overlimits 0)
     

If a backlog occurs, this is displayed as well.

pfifo and bfifo, like all non-default qdiscs, maintain statistics. This might be a reason to prefer that over the default.

6.2. pfifo_fast, the default Linux qdisc

The pfifo_fast (three-band first in, first out queue) qdisc is the default qdisc for all interfaces under Linux. Whenever an interface is created, the pfifo_fast qdisc is automatically used as a queue. If another qdisc is attached, it preempts the default pfifo_fast, which automatically returns to function when an existing qdisc is detached.

Figure 8: pfifo_fast qdisc

6.2.1. pfifo_fast algorithm

Based on a conventional FIFO qdisc, this qdisc also provides some prioritization. It provides three different bands (individual FIFOs) for separating traffic. The highest priority traffic (interactive flows) are placed into band 0 and are always serviced first. Similarly, band 1 is always emptied of pending packets before band 2 is dequeued.

The algorithm is very similar to that of the classful prio qdisc. The pfifo_fast qdisc is like three pfifo queues side by side, where an individual packet will be enqueued in one of the three FIFOs based on its Type of Service (ToS) bits. Not all three bands are dequeued simultaneously - as long as lower-numbered (higher priority) bands have traffic, higher-numbered bands are never dequeued. This can be used to prioritize interactive traffic or penalize 'lowest cost' traffic. Each band can be txqueuelen packets long, as configured with ifconfig or ip. Any additional packets arriving, once a specific band is full, are not enqueued but are instead dropped.

See chapter 6.2.3 for complete details on how ToS bits are translated into bands.

6.2.2. txqueuelen parameter

The length of the three bands depends on the interface txqueuelen, as specified with ifconfig or ip.

6.2.3. Bugs

There is nothing configurable to the end user about the pfifo_fast qdisc. This is how it is configured by default:

priomap determines how packet priorities, as assigned by the kernel, map to bands. Mapping occurs based on the ToS octet of the packet, which looks like this:

The four ToS bits (the 'ToS field') are defined as:

As there is 1 bit to the right of these four bits, the actual value of the ToS field is double the value of the ToS bits. Running tcpdump -v -v shows you the value of the entire ToS field, not just the four bits. It is the value you see in the first column of this table:

Lots of numbers. The second column contains the value of the relevant four ToS bits, followed by their translated meaning. For example, 15 stands for a packet wanting Minimal Monetary Cost, Maximum Reliability, Maximum Throughput AND Minimum Delay.

The fourth column lists the way the Linux kernel interprets the ToS bits, by showing to which Priority they are mapped.

The last column shows the result of the default priomap. On the command line, the default priomap looks like this:

1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1

This means that priority 4, for example, gets mapped to band number 1. The priomap also allows you to list higher priorities (> 7) which do not correspond to ToS mappings, but which are set by other means.

Moreover, differently from other non standard qdisc, pfifo_fast does not maintain statistics and does not show up in tc qdisc ls. This is because it is the automatic default in the absence of a configured qdisc.

6.3. SFQ, Stochastic Fair Queuing

Stochastic Fairness Queueing is a classless queueing discipline available for traffic control with the tc command. SFQ does not shape traffic but only schedules the transmission of packets, based on 'flows'. The goal is to ensure fairness so that each flow is able to send data in turn, thus preventing any single flow from drowning out the rest. It accomplishes this by using a hash function to separate the traffic into separate (internally maintained) FIFOs which are dequeued in a round-robin fashion. Because there is the possibility for unfairness to manifest in the choice of hash function, this function is altered periodically (see 6.3.1 algorithm). Perturbation (the parameter perturb) sets this periodicity (see 6.3.3 parameters). This may in fact have some effect in mitigating a Denial of Service attempt. SFQ is work-conserving and therefore always delivers a packet if it has one available.

So, the SFQ qdisc attempts to fairly distribute opportunity to transmit data to the network among an arbitrary number of flows.

Figure 9: Stochastic Fair Queuing (SFQ)

6.3.1. SFQ Algorithm

On enqueueing, each packet is assigned to a hash bucket, based on the packets hash value. This hash value is either obtained from an external flow classifier (use tc filter to set them), or a default internal classifier if no external classifier has been configured. When the internal classifier is used, sfq uses

  • Source address;

  • Destination address;

  • Source and Destination port;

if these are available.

SFQ knows about ipv4 and ipv6 and also UDP, TCP and ESP. Packets with other protocols are hashed based on the 32bits representation of their destination and source. A flow corresponds mostly to a TCP/IP connection.

Each of these buckets should represent a unique flow. Because multiple flows may get hashed to the same bucket, sfqs internal hashing algorithm may be perturbed at configurable intervals so that the unfairness lasts only for a short while. Perturbation may however cause some inadvertent packet reordering to occur. After linux-3.3, there is no packet reordering problem, but possible packet drops if rehashing hits one limit (number of flows or packets per flow)

When dequeuing, each hashbucket with data is queried in a round robin fashion.

Before linux-3.3, the compile time maximum length of the SFQ is 128 packets, which can be spread over at most 128 buckets of 1024 available. In case of overflow, tail-drop is performed on the fullest bucket, thus maintaining fairness.

After linux-3.3, maximum length of SFQ is 65535 packets, and divisor limit is 65536. In case of overflow, tail-drop is performed on the fullest bucket, unless headdrop was requested.

6.3.2. Synopsis

tc  qdisc ... [divisor hashtablesize] [limit packets] [perturb seconds] [quantum bytes] [flows number] [depth number] [head drop] [redflowlimit bytes] [min bytes] [max bytes] [avpkt bytes] [burst packets] [probability P] [ecn] [harddrop]
        

6.3.3. Parameters

  • divisor: can be used to set a different hash table size, available from kernel 2.6.39 onwards. The specified divisor must be a power of two and cannot be larger than 65536. Default value is 1024.

  • limit: upper limit of the SFQ. Can be used to reduce the default length of 127 packets. After linux-3.3, it can be raised.

  • depth: limit of packets per flow (after linux-3.3). Default to 127 and can be lowered.

  • perturb: interval in seconds for queue algorithm perturbation. Defaults to 0, which means that no perturbation occurs. Do not set too low for each perturbation may cause some packet reordering or losses. Advised value is 60. This value has no effect when external flow classification is used. Its better to increase divisor value to lower risk of hash collisions.

  • quantum: amount of bytes a flow is allowed to dequeue during a round of the round robin process. Defaults to the MTU of the interface which is also the advised value and the minimum value.

  • flows: after linux-3.3, it is possible to change the default limit of flows. Default value is 127.

  • headdrop: default SFQ behavior is to perform tail-drop of packets from a flow. You can ask a headdrop instead, as this is known to provide a better feedback for TCP flows

  • redflowlimit: configure the optional RED module on top of each SFQ flow. Random Early Detection principle is to perform packet marks or drops in a probabilistic way. Redflowlimit configures the hard limit on the real (not average) queue size per SFQ flow in bytes.

  • min: average queue size at which marking becomes a possibility. Defaults to max /3.

  • max: at this average queue size, the marking probability is maximal. Defaults to redflowlimit /4.

  • probability: maximum probability for marking, specified as a floating point number from 0.0 to 1.0. Default value is 0.02.

  • avpkt: specified in bytes. Used with burst to determine the time constant for average queue size calculations. Default value is 1000.

  • burst: used for determining how fast the average queue size is influenced by the real queue size. Default value is: (2 * min + max) / (3 * avpkt).

  • ecn: RED can either 'mark' or 'drop'. Explicit Congestion Notification allows RED to notify remote hosts that their rate exceeds the amount of bandwidth available. Non-ECN capable hosts can only be notified by dropping a packet. If this parameter is specified, packets which indicate that their hosts honor ECN will only be marked and not dropped, unless the queue size hits depth packets.

  • harddrop: if average flow queue size is above max bytes, this parameter forces a drop instead of ecn marking.

6.3.4. Example and Usage

To attach to device ppp0:

$ tc qdisc add dev ppp0 root sfq
    

Please note that SFQ, like all non-shaping (work-conserving) qdiscs, is only useful if it owns the queue. This is the case when the link speed equals the actually available bandwidth. This holds for regular phone modems, ISDN connections and direct non-switched ethernet links.

Most often, cable modems and DSL devices do not fall into this category. The same holds for when connected to a switch and trying to send data to a congested segment also connected to the switch. In this case, the effective queue does not reside within Linux and is therefore not available for scheduling. Embed SFQ in a classful qdisc to make sure it owns the queue.

It is possible to use external classifiers with sfq, for example to hash traffic based only on source/destination ip addresses

$ tc filter add ... flow hash keys src,dst perturb 30 divisor 1024
    

Note that the given divisor should match the one used by sfq. If you have changed the sfq default of 1024, use the same value for the flow hash filter, too.

Example 7. SFQ with optional RED mode

[root@leander]# tc qdisc add dev eth0 parent 1:1 handle 10: sfq limit 3000 flows 512 divisor 16384 redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn headdrop
        

Example 8. Creating an SFQ

[root@leander]# cat sfq.tcc
/*
 * make an SFQ on eth0 with a 10 second perturbation
 *
 */

dev eth0 {
    egress {
        sfq( perturb 10s );
    }
}
[root@leander]# tcc < sfq.tcc
# ================================ Device eth0 ================================

tc qdisc add dev eth0 handle 1:0 root dsmark indices 1 default_index 0
tc qdisc add dev eth0 handle 2:0 parent 1:0 sfq perturb 10
      

Unfortunately, some clever software (e.g. Kazaa and eMule among others) obliterate the benefit of this attempt at fair queuing by opening as many TCP sessions (flows) as can be sustained. In many networks, with well-behaved users, SFQ can adequately distribute the network resources to the contending flows, but other measures may be called for when obnoxious applications have invaded the network.

See also Section 6.4, “ESFQ, Extended Stochastic Fair Queuing” for an SFQ qdisc with more exposed parameters for the user to manipulate.

6.4. ESFQ, Extended Stochastic Fair Queuing

Conceptually, this qdisc is no different than SFQ although it allows the user to control more parameters than its simpler cousin. This qdisc was conceived to overcome the shortcoming of SFQ identified above. By allowing the user to control which hashing algorithm is used for distributing access to network bandwidth, it is possible for the user to reach a fairer real distribution of bandwidth.

Example 9. ESFQ usage

Usage: ... esfq [ perturb SECS ] [ quantum BYTES ] [ depth FLOWS ]
        [ divisor HASHBITS ] [ limit PKTS ] [ hash HASHTYPE]

Where:
HASHTYPE := { classic | src | dst }
      

FIXME; need practical experience and/or attestation here.

6.5. RED,Random Early Drop

6.5.1. Description

Random Early Detection is a classless qdisc which manages its queue size smartly. Regular queues simply drop packets from the tail when they are full, which may not be the optimal behaviour. RED also performs tail drop, but does so in a more gradual way.

Once the queue hits a certain average length, packets enqueued have a configurable chance of being marked (which may mean dropped). This chance increases linearly up to a point called the max average queue length, although the queue might get bigger.

This has a host of benefits over simple taildrop, while not being processor intensive. It prevents synchronous retransmits after a burst in traffic, which cause further retransmits, etc. The goal is to have a small queue size, which is good for interactivity while not disturbing TCP/IP traffic with too many sudden drops after a burst of traffic.

Depending on if ECN is configured, marking either means dropping or purely marking a packet as overlimit.

6.5.2. Algorithm

The average queue size is used for determining the marking probability. This is calculated using an Exponential Weighted Moving Average, which can be more or less sensitive to bursts. When the average queue size is below min bytes, no packet will ever be marked. When it exceeds min, the probability of doing so climbs linearly up to probability, until the average queue size hits max bytes. Because probability is normally not set to 100%, the queue size might conceivably rise above max bytes, so the limit parameter is provided to set a hard maximum for the size of the queue.

6.5.3. Synopsis

$ tc  qdisc ... red limit bytes [min bytes] [max bytes] avpkt bytes [burst packets] [ecn] [harddrop] [bandwidth rate]  [probability chance] [adaptive]
                

6.5.4. Parameters

  • min: average queue size at which marking becomes a possibility. Defaults to max/3.

  • max: at this average queue size, the marking probability is maximal. Should be at least twice min to prevent synchronous retransmits, higher for low min. Default to limit/4.

  • probability: maximum probability for marking, specified as a floating point number from 0.0 to 1.0. Suggested values are 0.01 or 0.02 (1 or 2%, respectively). Default is 0.02.

  • limit: hard limit on the real (not average) queue size in bytes. Further packets are dropped. Should be set higher than max+burst. It is advised to set this a few times higher than max.

  • burst: used for determining how fast the average queue size is influenced by the real queue size. Larger values make the calculation more sluggish, allowing longer bursts of traffic before marking starts. Real life experiments support the following guideline (min+min+max)/(3*avpkt).

  • Avpkt: specified in bytes. Used with burst to determine the time constant for average queue size calculations. 1000 is a good value.

  • bandwidth: this rate is used for calculating the average queue size after some idle time. Should be set to the bandwidth of your interface. Does not mean that RED will shape for you! Optional. Default is 10Mbit.

  • ecn: as mentioned before, RED can either 'mark' or 'drop'. Explicit Congestion Notification allows RED to notify remote hosts that their rate exceeds the amount of bandwidth available. Non-ECN capable hosts can only be notified by dropping a packet. If this parameter is specified, packets which indicate that their hosts honor ECN will only be marked and not dropped, unless the queue size hits limit bytes. Recommended.

  • harddrop: If average flow queue size is above max bytes, this parameterv forces a drop instead of ecn marking.

  • adaptive: (added in linux-3.3) Sets RED in adaptive mode as described in http://icir.org/floyd/papers/adaptiveRed.pdf. Goal of Adaptive RED is to make 'probability' dynamic value between 1% and 50% to reach the target average queue: (max - min) / 2.

6.5.5. Example

# tc qdisc add dev eth0 parent 1:1 handle 10: red limit 400000 min 30000 max 90000 avpkt 1000 burst 55 ecn adaptive bandwidth 10Mbit
            

6.6. GRED, Generic Random Early Drop

Generalized RED is used in DiffServ implementation and it has virtual queue (VQ) within physical queue. Currently, the number of virtual queues is limited to 16.

GRED is configured in two steps. First the generic parameters are configured to select the number of virtual queues DPs and whether to turn on the RIO-like buffer sharing scheme. Also at this point, a default virtual queue is selected.

The second step is used to set parameters for individual virtual queues.

6.6.1. Synopsis

... gred DP drop-probability limit BYTES min BYTES max BYTES avpkt BYTES burst PACKETS probability PROBABILITY bandwidth KBPS [prio value]

OR

... gred setup DPs "num of DPs" default "default DP" [grio]
        

6.6.2. Parameter

  • setup: identifies that this is a generic setup for GRED;

  • DPs: is the number of virtual queues;

  • default: specifies default virtual queue;

  • grio: turns on the RIO-like buffering scheme;

  • limit: defines the virtual queue “physical” limit in bytes;

  • min: defines the minimum threshold value in bytes;

  • max: defines the maximum threshold value in bytes;

  • avpkt: is the average packet size in bytes;

  • bandwidth: is the wire-speed of the interface;

  • burst: is the number of average-sized packets allowed to burst;

  • probability: defines the drop probability in the range (0…);

  • DP: identifies the virtual queue assigned to these parameters;

  • prio: identifies the virtual queue priority if grio was set in general parameters;

6.7. TBF, Token Bucket Filter

This qdisc is built on tokens and buckets. It simply shapes traffic transmitted on an interface. To limit the speed at which packets will be dequeued from a particular interface, the TBF qdisc is the perfect solution. It simply slows down transmitted traffic to the specified rate.

Packets are only transmitted if there are sufficient tokens available. Otherwise, packets are deferred. Delaying packets in this fashion will introduce an artificial latency into the packet's round trip time.

Figure 10: Token Bucket Filter (TBF)

6.7.1. Algorithm

As the name implies, traffic is filtered based on the expenditure of tokens (non-work-conserving). Tokens roughly correspond to bytes, with the additional constraint that each packet consumes some tokens, no matter how small it is. This reflects the fact that even a zero-sized packet occupies the link for some time. On creation, the TBF is stocked with tokens which correspond to the amount of traffic that can be burst in one go. Tokens arrive at a steady rate, until the bucket is full. If no tokens are available, packets are queued, up to a configured limit. The TBF now calculates the token deficit, and throttles until the first packet in the queue can be sent. If it is not acceptable to burst out packets at maximum speed, a peakrate can be configured to limit the speed at which the bucket empties. This peakrate is implemented as a second TBF with a very small bucket, so that it doesn't burst.

6.7.2. Parameters

  • limit or latency: limit is the number of bytes that can be queued waiting for tokens to become available. You can also specify this the other way around by setting the latency parameter, which specifies the maximum amount of time a packet can sit in the TBF. The latter calculation takes into account the size of the bucket, the rate and possibly the peakrate (if set). These two parameters are mutually exclusive.

  • Burst: also known as buffer or maxburst. Size of the bucket, in bytes. This is the maximum amount of bytes that tokens can be available for instantaneously. In general, larger shaping rates require a larger buffer. For 10mbit/s on Intel, you need at least 10kbyte buffer if you want to reach your configured rate. If your buffer is too small, packets may be dropped because more tokens arrive per timer tick than fit in your bucket. The minimum buffer size can be calculated by dividing the rate by HZ.

    Token usage calculations are performed using a table which by default has a resolution of 8 packets. This resolution can be changed by specifying the cell size with the burst. For example, to specify a 6000 byte buffer with a 16 byte cell size, set a burst of 6000/16. You will probably never have to set this. Must be an integral power of 2.

  • Mpu: a zero-sized packet does not use zero bandwidth. For ethernet, no packet uses less than 64 bytes. The Minimum Packet Unit determines the minimal token usage (specified in bytes) for a packet. Defaults to zero.

  • Rate: the speed knob. Furthermore, if a peakrate is desired, the following parameters are available:

  • peakrate: maximum depletion rate of the bucket. The peakrate does not need to be set, it is only necessary if perfect millisecond timescale shaping is required.

  • mtu/minburst: specifies the size of the peakrate bucket. For perfect accuracy, should be set to the MTU of the interface. If a peakrate is needed, but some burstiness is acceptable, this size can be raised. A 3000 byte minburst allows around 3mbit/s of peakrate, given 1000 byte packets. Like the regular burstsize you can also specify a cell size.

6.7.3. Example

Example 10. Creating a 256kbit/s TBF

[root@leander]# cat tbf.tcc
/*
 * make a 256kbit/s TBF on eth0
 *
 */

dev eth0 {
    egress {
        tbf( rate 256 kbps, burst 20 kB, limit 20 kB, mtu 1514 B );
    }
}
[root@leander]# tcc < tbf.tcc
# ================================ Device eth0 ================================

tc qdisc add dev eth0 handle 1:0 root dsmark indices 1 default_index 0
tc qdisc add dev eth0 handle 2:0 parent 1:0 tbf burst 20480 limit 20480 mtu 1514 rate 32000bps