4. What To Optimize

4.1. First, add more memory

Max out your memory. Having lots of free memory will improve your virtual-memory performance (and Unix takes advantage of extra memory more effectively than Windows does). Fortunately, with RAM as cheap as it is now, a gigabyte or three is unlikely to bust your budget even if you're economizing.

4.2. Bus and Disk speeds

Most people think of the processor as the most important choice in specifying any kind of personal-computer system. But for typical job loads under Linux, the processor type is nearly a red herring — it's far more important to specify a capable system bus and disk I/O subsystem. If you don't believe this, you may find it enlightening to keep top(1) running for a while as you use your machine. Notice how seldom the CPU idle percentage drops below 90%!

It's true that after people upgrade their motherboards they often do report a throughput increase. But this is often more due to other changes that go with the processor upgrade, such as improved cache memory or an increase in the clocking speed of the system's front-side bus (enabling data to get in and out of the processor faster).

If you're buying for Linux on a fixed budget, it makes sense to trade away some excess processor clocks to get a faster bus and disk subsystem. If you're building a monster hot-rod, go ahead and buy that fastest available processor — but once you've gotten past that gearhead desire for big numbers, pay careful attention to bus speeds and your disk subsystem, because that's where you'll get the serious performance wins. The gap between processor speed and I/O subsystem throughput has only widened in the last five years.

How does it translate into a recipe in 2010? Like this; if you're building a hot rod,

  • Do buy a machine with the fastest available "front-side" (e.g. processor-to-memory) bus.

  • Do get the fastest SATA disks you can get your hands on.

4.3. Optimizing your disk subsystem

For the fastest disks you can find, pay close attention to average seek and latency time. The former is an average time required to seek to any track; the latter is the maximum time required for any sector on a track to come under the heads, and is a function of the disk's rotation speed.

Of these, average seek time is much more important. When you're running Linux or any other virtual-memory operating system, a one millisecond faster seek time can make a really substantial difference in system throughput. Back when PC processors were slow enough for the comparison to be possible (and I was running System V Unix), it was easily worth as much as a 30MHz increment in processor speed. Today the corresponding figure would probably be as much as 300MHz!

The manufacturers themselves avoid running up seek time on the larger-capacity drives by stacking platters vertically rather than increasing the platter size. Thus, seek time (which is proportional to the platter radius and head-motion speed) tends to be constant across different capacities in the same product line. This is good because it means you don't have to worry about a capacity-vs.-speed tradeoff.

Average drive latency is inversely proportional to the disk's rotational speed. For years, most disks spun at 3600 rpm; most disks now spin at 7,200 or 10,000rpm, and high-end disks at 15,000 rpm. These fast-spin disks run extremely hot; cooling is becoming a critical constraint in drive design.

For years, your basic decision was SATA vs. SCSI (the older IDE and EIDE buses are long obsolete). Not in 2009; SATA 3 devices and controllers are good enough that the performance advantage of SCSI is marginal unless you are designing a super-high-end server box - slightly faster transfer speeds (320MB/s vs. 300MB/s) and slightly better susrained throughput.

The SCSI price premium entailed in an extra controller and more expensive disks are no longer worth it for the home builder, even from the point of view of grizzled old SCSI partisans like me. Accordingly, I've dropped most of the detailed SCSI information I used to carry here.

Final note: Solid-state drives loom on the horizon as replacements for SATA disks, but the price per megabyte is still high enough that as yet they're only being deployed in small capacities on netbooks. Watch this space.

4.4. Tuning Your I/O Subsystem

(This section comes to us courtesy of Perry The Cynic, <perry@sutr.cynic.org>; it was written in 1998. My own experience agrees pretty completely with his. I have revised the numbers in it since to reflect more recent developments.)

Building a good I/O subsystem boils down to two major points: pick matched components so you don't over-build any piece without benefit, and construct the whole pipe such that it can feed what your OS/application combo needs.

It's important to recognize that balance is with respect to not only a particular processor/memory subsystem, but also to a particular OS and application mix. A Unix server machine running the whole TCP/IP server suite has radically different I/O requirements than a video-editing workstation. For the big boys a good consultant will sample the I/O mix (by reading existing system performance logs or taking new measurements) and figure out how big the I/O system needs to be to satisfy that app mix. This is not something your typical Linux buyer will want to do; for one, the application mix is not static and will change over time. So what you'll do instead is design an I/O subsystem that is internally matched and provides maximum potential I/O performance for the money you're willing to spend. Then you look at the price points and compare them with those for the memory subsystem. That's the most important trade-off inside the box.

So the job now is to design and buy an I/O subsystem that is well matched to provide the best bang for your buck. The two major performance numbers for disk I/O are latency and bandwidth. Latency is how long a program has to wait to get a little piece of random data it asked for. Bandwidth is how much contiguous data can be sent to/from the disk once you've done the first piece. Latency is measured in milliseconds (ms); bandwidth in megabytes per second (MB/s). Obviously, a third number of interest is how big all of your disks are together (how much storage you've got), in Gigabytes (GB).

Within a rather big envelope, minimizing latency is the cat's meow. Every millisecond you shave off effective latency will make your system feel significantly faster. Bandwidth, on the other hand, only helps you if you suck a big chunk of contiguous data off the disk, which happens rarely to most programs. You have to keep bandwidth in mind to avoid mis-matching pieces, because (obviously) the lowest usable bandwidth in a pipe constrains everything else.

I'm going to ignore IDE. IDE is no good for multi-processing systems, period. You may use an IDE CD-ROM if you don't care about its performance, but if you care about your I/O performance, go SCSI. (Beware that if you mix an IDE CD-ROM with SCSI drives under Linux, you'll have to run a SCSI emulation layer that is a bit flaky.)

Let's look at the disks first. Whenever you seriously look at a disk, get its data sheet. Every reputable manufacturer has them on their website; just read off the product code and follow the bouncing lights. Beware of numbers (`<12ms fast!') you may see in ads; these folks often look for the lowest/highest numbers on the data sheet and stick them into the ad copy. Not dishonest (usually), but ignorant.

What you need to find out for a disk is:

  1. What kind of SCSI interface does it have? Look for "fast", "ultra", and "wide". Ignore disks that say "fiber" (this is a specialty physical layer not appropriate for the insides of small computers). Note that you'll often find the same disk with different interfaces.

  2. What is the "typical seek" time (ms)? Make sure you get "typical", not "track-to-track" or "maximum" or some other measure (these don't relate in obvious ways, due to things like head-settling time).

  3. What is the rotational speed? This is typically 4500, 5400, 7200, 10000, or 15000 rpm (rotations per minute). Also look for "rotational latency" (in ms). (In a pinch, average rotational latency is approx. 30000/rpm in milliseconds.)

  4. What is the ‘media transfer rate’ or speed (in MB/s)? Many disks will have a range of numbers (say, 7.2-10.8MB/s). Don't confuse this with the "interface transfer rate" which is always a round number (10 or 20 or 40MB/s) and is the speed of the SCSI bus itself.

These numbers will let you do apple-with-apples comparisons of disks. Beware that they will differ on different-size models of the same disk; typically, bigger disks have slower seek times.

Now what does it all mean? Bandwidth first: the `media transfer rate' is how much data you can, under ideal conditions, get off the disk per second. This is a function mostly of rotation speed; the faster the disk rotates, the more data passes under the heads per time unit. This constrains the sustained bandwidth of this disk.

More interestingly, your effective latency is the sum of typical seek time and rotational latency. So for a disk with 8.5ms seek time and 4ms rotational latency, you can expect to spend about 12.5ms between the moment the disk `wants' to read your data and the moment when it actually starts reading it. This is the one number you are trying to make small. Thus, you're looking for a disk with low seek times and high rotation (RPM) rates.

For comparison purposes, the first hard drive I ever bought was a 20MB drive with 65ms seek time and about 3000RPM rotation. A floppy drive has about 100-200ms seek time. A CD-ROM drive can be anywhere between 120ms (fast) and 400ms (slow). The best IDE harddrives have about 10-12ms and 5400 rpm. The best SCSI harddrive I know (the Fujitsu MAM) runs 3.9ms/15000rpm.

Fast, big drives are expensive. Really big drives are very expensive. Really fast drives are pretty expensive. On the other end, really slow, small drives are cheap but not cost effective, because it doesn't cost any less to make the cases, ship the drives, and sell them.

In between is a ‘sweet spot’ where moving in either direction (cheaper or more expensive) will cost you more than you get out of it. The sweet spot moves (towards better value) with time. If you can make the effort, go to your local computer superstore and write down a dozen or so drives they sell ‘naked’. (If they don't sell at least a dozen hard drives naked, find yourself a better store. Use the Web, Luke!) Plot cost against size, seek and rotational speed, and it will usually become pretty obvious which ones to get for your budget.

Do look for specials in stores; many superstores buy overstock from manufacturers. If this is near the sweet spot, it's often surprisingly cheaper than comparable drives. Just make sure you understand the warranty procedures.

Note that if you need a lot of capacity, you may be better off with two (or more) drives than a single, bigger one. Not only can it be cheaper but you end up with two separate head assemblies that move independently, which can cut down on latency quite a bit (see below).

If you find yourself at the high end of the bandwidth game, be aware that the theoretical maximum of the PCI bus itself is 132MB/s. That means that a dual ultra/wide SCSI controller (2x40MB/s) can fill more than half of the PCI bus's bandwidth, and it is not advised to add another fast controller to that mix. As it is, your device driver better be well written, or your entire system will melt down (figuratively speaking).

Incidentally, all of the numbers I used are ‘optimal’ bandwidth numbers. The real scoop is usually somewhere between 50-70% of nominal, but things tend to cancel out — the drives don't quite transfer as fast as they might, but the SCSI bus has overhead too, as does the controller card.

Whether you have a single disk or multiple ones, on one or several SCSI buses, you should give careful thought to their partition layout. Given a set of disks and controllers, this is the most crucial performance decision you'll make.

A partition is a contiguous group of sectors on the disk. Partitioning typically starts at the outside and proceeds inwards. All partitions on one disk share a single head assembly. That means that if you try to overlap I/O on the first and last partition of a disk, the heads must move full stroke back and forth over the disk, which can radically increase seek time delays. A partition that is in the middle of a partition stack is likely to have best seek performance, since at worst the heads only have to move half-way to get there (and they're likely to be around the area anyway).

Whenever possible, split partitions that compete onto different disks. For example, /usr and the swap should be on different disks if at all possible (unless you have outrageous amounts of RAM).

Another wrinkle is that most modern disks use `zone sectoring'. The upshot is that outside partitions will have higher bandwidth than inner ones (there is more data under the heads per revolution). So if you need a work area for data streaming (say, a CD-R pre-image to record), it should go on an outside (early numbered) partition of a fast-rotating disk. Conversely, it's a good convention to put rarely-used, performance-uncritical partitions on the inside (last).

Ah yes, caches. There are three major points where you could cache I/O buffers: the OS, the controller card or chip in your machine, and the on-disk controller. Intelligent OS caching is by far the biggest win, for many reasons. RAM caches on controller cards are pretty pointless these days; you shouldn't pay extra for them, and experiment with disabling them if you're into tinkering.

RAM caches on the drives themselves are a mixed bag. At moderate size (1-2MB), they are a potential big win for Windows 95/98, because Windows has stupid VM and I/O drivers. If you run a true multi-tasking OS like Linux, having unified RAM caches on the disk is a significant loss, since the overlapping I/O threads kick each other out of the cache, and the disk ends up performing work for nothing.

Most high-performance disks can be reconfigured (using mode page parameters, see above) to have `segmented' caches (sort of like a set-associative memory cache). With that configured properly, the RAM caches can be a moderate win, not because caching is so great on the disk (it's much better in the OS), but because it allows the disk controller more flexibility to reschedule its I/O request queue. You won't really notice it unless you routinely have >2 I/O requests pending at the SCSI level. The conventional wisdom (try it both ways) applies.

And finally I do have to make a disclaimer. Much of the stuff above is shameless simplification. In reality, high-performance disks are very complicated beasties. They run little mini-operating systems that are most comfortable if they have 10-20 I/O requests pending at the same time. Under those circumstances, the amortized global latencies are much reduced, though any single request may experience longer latencies than if it were the only one pending. The only really valid analysis are stochastic-process models, which we really don't want to get into here. :-)