5. Software and Tools

5.1. Kernel requirements

Many distributions provide kernels with modular or monolithic support for traffic control (Quality of Service). Custom kernels may not already provide support (modular or not) for the required features. If not, this is a very brief listing of the required kernel options.

The user who has little or no experience compiling a kernel is recommended to Kernel HOWTO. Experienced kernel compilers should be able to determine which of the below options apply to the desired configuration, after reading a bit more about traffic control and planning.

Example 1. Kernel compilation options [8]

# QoS and/or fair queueing

A kernel compiled with the above set of options will provide modular support for almost everything discussed in this documentation. The user may need to modprobe module before using a given feature. Again, the confused user is recommended to the Kernel HOWTO, as this document cannot adequately address questions about the use of the Linux kernel.

5.2. iproute2 tools (tc)

iproute2 is a suite of command line utilities which manipulate kernel structures for IP networking configuration on a machine. For technical documentation on these tools, see the iproute2 documentation and for a more expository discussion, the documentation at linux-ip.net. Of the tools in the iproute2 package, the binary tc is the only one used for traffic control. This HOWTO will ignore the other tools in the suite.

Because it interacts with the kernel to direct the creation, deletion and modification of traffic control structures, the tc binary needs to be compiled with support for all of the qdiscs you wish to use. In particular, the HTB qdisc is not supported yet in the upstream iproute2 package. See Section 7.1, “HTB, Hierarchical Token Bucket” for more information.

The tc tool performs all of the configuration of the kernel structures required to support traffic control. As a result of its many uses, the command syntax can be described (at best) as arcane. The utility takes as its first non-option argument one of three Linux traffic control components, qdisc, class or filter.

Example 2. tc command usage

[root@leander]# tc
Usage: tc [ OPTIONS ] OBJECT { COMMAND | help }
where  OBJECT := { qdisc | class | filter }
       OPTIONS := { -s[tatistics] | -d[etails] | -r[aw] }

Each object accepts further and different options, and will be incompletely described and documented below. The hints in the examples below are designed to introduce the vagaries of tc command line syntax. For more examples, consult the LARTC HOWTO. For even better understanding, consult the kernel and iproute2 code.

Example 3. tc qdisc

[root@leander]# tc qdisc add    \ 1
>                  dev eth0     \ 2
>                  root         \ 3
>                  handle 1:0   \ 4
>                  htb            5


Add a queuing discipline. The verb could also be del.


Specify the device onto which we are attaching the new queuing discipline.


This means egress to tc. The word root must be used, however. Another qdisc with limited functionality, the ingress qdisc can be attached to the same device.


The handle is a user-specified number of the form major:minor. The minor number for any queueing discipline handle must always be zero (0). An acceptable shorthand for a qdisc handle is the syntax "1:", where the minor number is assumed to be zero (0) if not specified.


This is the queuing discipline to attach, HTB in this example. Queuing discipline specific parameters will follow this. In the example here, we add no qdisc-specific parameters.

Above was the simplest use of the tc utility for adding a queuing discipline to a device. Here's an example of the use of tc to add a class to an existing parent class.

Example 4. tc class

[root@leander]# tc class add    \ 1
>                  dev eth0     \ 2
>                  parent 1:1   \ 3
>                  classid 1:6  \ 4
>                  htb          \ 5
>                  rate 256kbit \ 6
>                  ceil 512kbit   7


Add a class. The verb could also be del.


Specify the device onto which we are attaching the new class.


Specify the parent handle to which we are attaching the new class.


This is a unique handle (major:minor) identifying this class. The minor number must be any non-zero (0) number.


Both of the classful qdiscs require that any children classes be classes of the same type as the parent. Thus an HTB qdisc will contain HTB classes.

6 7

This is a class specific parameter. Consult Section 7.1, “HTB, Hierarchical Token Bucket” for more detail on these parameters.

Example 5. tc filter

[root@leander]# tc filter add               \ 1
>                  dev eth0                 \ 2
>                  parent 1:0               \ 3
>                  protocol ip              \ 4
>                  prio 5                   \ 5
>                  u32                      \ 6
>                  match ip port 22 0xffff  \ 7
>                  match ip tos 0x10 0xff   \ 8
>                  flowid 1:6               \ 9
>                  police                   \ 10
>                  rate 32000bps            \ 11
>                  burst 10240              \ 12
>                  mpu 0                    \ 13
>                  action drop/continue       14


Add a filter. The verb could also be del.


Specify the device onto which we are attaching the new filter.


Specify the parent handle to which we are attaching the new filter.


This parameter is required. It's use should be obvious, although I don't know more.


The prio parameter allows a given filter to be preferred above another. The pref is a synonym.


This is a classifier, and is a required phrase in every tc filter command.

7 8

These are parameters to the classifier. In this case, packets with a type of service flag (indicating interactive usage) and matching port 22 will be selected by this statement.


The flowid specifies the handle of the target class (or qdisc) to which a matching filter should send its selected packets.


This is the policer, and is an optional phrase in every tc filter command.


The policer will perform one action above this rate, and another action below (see action parameter).


The burst is an exact analog to burst in HTB (burst is a buckets concept).


The minimum policed unit. To count all traffic, use an mpu of zero (0).


The action indicates what should be done if the rate based on the attributes of the policer. The first word specifies the action to take if the policer has been exceeded. The second word specifies action to take otherwise.

As evidenced above, the tc command line utility has an arcane and complex syntax, even for simple operations such as these examples show. It should come as no surprised to the reader that there exists an easier way to configure Linux traffic control. See the next section, Section 5.3, “tcng, Traffic Control Next Generation”.

5.3. tcng, Traffic Control Next Generation

FIXME; sing the praises of tcng. See also Traffic Control using tcng and HTB HOWTO and tcng documentation.

Traffic control next generation (hereafter, tcng) provides all of the power of traffic control under Linux with twenty percent of the headache.

5.4. Netfilter

Netfilter is a framework provided by the Linux kernel that allows various networking-related operations to be implemented in the form of customized handlers. Netfilter offers various functions and operations for packet filtering, network address translation, and port translation, which provide the functionality required for directing packets through a network, as well as for providing ability to prohibit packets from reaching sensitive locations within a computer network.

Netfilter represents a set of hooks inside the Linux kernel, allowing specific kernel modules to register callback functions with the kernel's networking stack. Those functions, usually applied to the traffic in form of filtering and modification rules, are called for every packet that traverses the respective hook within the networking stack.

5.4.1. iptables

iptables is a user-space application program that allows a system administrator to configure the tables provided by the Linux kernel firewall (implemented as different Netfilter modules) and the chains and rules it stores. Different kernel modules and programs are currently used for different protocols; iptables applies to IPv4, ip6tables to IPv6, arptables to ARP, and ebtables to Ethernet frames.

iptables requires elevated privileges to operate and must be executed by user root, otherwise it fails to function. On most Linux systems, iptables is installed as /usr/sbin/iptables and documented in its man pages, which can be opened using man iptables when installed. It may also be found in /sbin/iptables, but since iptables is more like a service rather than an "essential binary", the preferred location remains /usr/sbin.

The term iptables is also commonly used to inclusively refer to the kernel-level components. x_tables is the name of the kernel module carrying the shared code portion used by all four modules that also provides the API used for extensions; subsequently, Xtables is more or less used to refer to the entire firewall (v4, v6, arp, and eb) architecture.

Xtables allows the system administrator to define tables containing chains of rules for the treatment of packets. Each table is associated with a different kind of packet processing. Packets are processed by sequentially traversing the rules in chains. A rule in a chain can cause a goto or jump to another chain, and this can be repeated to whatever level of nesting is desired. (A jump is like a “call”, i.e. the point that was jumped from is remembered.) Every network packet arriving at or leaving from the computer traverses at least one chain.

Figure 5: Packet flow paths. Packets start at a given box and will flow along a certain path, depending on the circumstances.

The origin of the packet determines which chain it traverses initially. There are five predefined chains (mapping to the five available Netfilter hooks, see figure 5), though a table may not have all chains.

Figure 6: netfilter’s hook

Predefined chains have a policy, for example DROP, which is applied to the packet if it reaches the end of the chain. The system administrator can create as many other chains as desired. These chains have no policy; if a packet reaches the end of the chain it is returned to the chain which called it. A chain may be empty.

  • PREROUTING: Packets will enter this chain before a routing decision is made (point 1 in Figure 6).

  • INPUT: Packet is going to be locally delivered. It does not have anything to do with processes having an opened socket; local delivery is controlled by the "local-delivery" routing table: ip route show table local (point 2 Figure 6).

  • FORWARD: All packets that have been routed and were not for local delivery will traverse this chain (point 3 in Figure 6).

  • OUTPUT: Packets sent from the machine itself will be visiting this chain (point 5 in Figure 6)

  • POSTROUTING: Routing decision has been made. Packets enter this chain just before handing them off to the hardware (point 4 in Figure 6).

Each rule in a chain contains the specification of which packets it matches. It may also contain a target (used for extensions) or verdict (one of the built-in decisions). As a packet traverses a chain, each rule in turn is examined. If a rule does not match the packet, the packet is passed to the next rule. If a rule does match the packet, the rule takes the action indicated by the target/verdict, which may result in the packet being allowed to continue along the chain or it may not. Matches make up the large part of rulesets, as they contain the conditions packets are tested for. These can happen for about any layer in the OSI model, as with e.g. the --mac-source and -p tcp --dport parameters, and there are also protocol-independent matches, such as -m time.

The packet continues to traverse the chain until either

  • a rule matches the packet and decides the ultimate fate of the packet, for example by calling one of the ACCEPT or DROP, or a module returning such an ultimate fate; or

  • a rule calls the RETURN verdict, in which case processing returns to the calling chain; or

  • the end of the chain is reached; traversal either continues in the parent chain (as if RETURN was used), or the base chain policy, which is an ultimate fate, is used.

Targets also return a verdict like ACCEPT (NAT modules will do this) or DROP (e.g. the REJECT module), but may also imply CONTINUE (e.g. the LOG module; CONTINUE is an internal name) to continue with the next rule as if no target/verdict was specified at all.

5.5. IMQ, Intermediate Queuing device

The Intermediate queueing device is not a qdisc but its usage is tightly bound to qdiscs. Within linux, qdiscs are attached to network devices and everything that is queued to the device is first queued to the qdisc and then to driver queue. From this concept, two limitations arise:

  • Only egress shaping is possible (an ingress qdisc exists, but its possibilities are very limited compared to classful qdiscs, as seen before).

  • A qdisc can only see traffic of one interface, global limitations can't be placed.

IMQ is there to help solve those two limitations. In short, you can put everything you choose in a qdisc. Specially marked packets get intercepted in netfilter NF_IP_PRE_ROUTING and NF_IP_POST_ROUTING hooks and pass through the qdisc attached to an imq device. An iptables target is used for marking the packets.

This enables you to do ingress shaping as you can just mark packets coming in from somewhere and/or treat interfaces as classes to set global limits. You can also do lots of other stuff like just putting your http traffic in a qdisc, put new connection requests in a qdisc, exc.

5.5.1. Sample configuration

The first thing that might come to mind is use ingress shaping to give yourself a high guaranteed bandwidth. Configuration is just like with any other interface:

tc qdisc add dev imq0 root handle 1: htb default 20

tc class add dev imq0 parent 1: classid 1:1 htb rate 2mbit burst 15k

tc class add dev imq0 parent 1:1 classid 1:10 htb rate 1mbit
tc class add dev imq0 parent 1:1 classid 1:20 htb rate 1mbit

tc qdisc add dev imq0 parent 1:10 handle 10: pfifo
tc qdisc add dev imq0 parent 1:20 handle 20: sfq
tc filter add dev imq0 parent 10:0 protocol ip prio 1 u32 match \ ip dst flowid 1:10

In this example u32 is used for classification. Other classifiers should work as expected. Next traffic has to be selected and marked to be enqueued to imq0.

iptables -t mangle -A PREROUTING -i eth0 -j IMQ --todev 0

ip link set imq0 up

The IMQ iptables targets is valid in the PREROUTING and POSTROUTING chains of the mangle table. It's syntax is

IMQ [ --todev n ]	n : number of imq device

An ip6tables target is also provided.

Please note traffic is not enqueued when the target is hit but afterwards. The exact location where traffic enters the imq device depends on the direction of the traffic (in/out). These are the predefined netfilter hooks used by iptables:

enum nf_ip_hook_priorities {
            NF_IP_PRI_FIRST = INT_MIN,
            NF_IP_PRI_CONNTRACK = -200,
            NF_IP_PRI_MANGLE = -150,
            NF_IP_PRI_NAT_DST = -100,
            NF_IP_PRI_FILTER = 0,
            NF_IP_PRI_NAT_SRC = 100,
            NF_IP_PRI_LAST = INT_MAX,

For ingress traffic, imq registers itself with NF_IP_PRI_MANGLE + 1 priority which means packets enter the imq device directly after the mangle PREROUTING chain has been passed.

For egress imq uses NF_IP_PRI_LAST which honours the fact that packets dropped by the filter table won't occupy bandwidth.

5.6. ethtool, Driver Queue

The ethtool command is used to control the driver queue size for Ethernet devices. ethtool also provides low level interface statistics as well as the ability to enable and disable IP stack and driver features.

The -g flag to ethtool displays the driver queue (ring) parameters (see Figure 1) :

$ethtool -g eth0

    Ring parameters for eth0:
    Pre-set maximums:
    RX:        16384
    RX Mini:    0
    RX Jumbo:    0
    TX:        16384
    Current hardware settings:
    RX:        512
    RX Mini:    0
    RX Jumbo:    0
    TX:        256

You can see from the above output that the driver for this NIC defaults to 256 descriptors in the transmission queue. It was often recommended to reduce the size of the driver queue in order to reduce latency. With the introduction of BQL (assuming your NIC driver supports it) there is no longer any reason to modify the driver queue size (see the below for how to configure BQL).

Ethtool also allows you to manage optimization features such as TSO, UFO and GSO. The -k flag displays the current offload settings and -K modifies them.

$ethtool -k eth0

    Offload parameters for eth0:
    rx-checksumming: off
    tx-checksumming: off
    scatter-gather: off
    tcp-segmentation-offload: off
    udp-fragmentation-offload: off
    generic-segmentation-offload: off
    generic-receive-offload: on
    large-receive-offload: off
    rx-vlan-offload: off
    tx-vlan-offload: off
    ntuple-filters: off
    receive-hashing: off

Since TSO, GSO, UFO and GRO greatly increase the number of bytes which can be queued in the driver queue you should disable these optimizations if you want to optimize for latency over throughput. It’s doubtful you will notice any CPU impact or throughput decrease when disabling these features unless the system is handling very high data rates.

[8] The options listed in this example are taken from a 2.4.20 kernel source tree. The exact options may differ slightly from kernel release to kernel release depending on patches and new schedulers and classifiers.