3.6 Final Tuning with /proc

3.6 Final Tuning with /proc

As mentioned earlier, the '/proc' file system is not really a file system at all, but a window on the running kernel. It contains handles that can be used to extract information from the kernel or, in some cases, change parameters deep inside the kernel. In this section, we discuss several of the most important parameters for Beowulfs. A multitude of Linux Web pages are dedicated to tuning the kernel and important daemons, with the goal of serving a few more Web pages per second. A good place to get started is linuxperf.nl.linux.org. Many Linux users take it as a personal challenge to tune the kernel sufficiently so their machine is faster than every other operating system in the world.

However, before diving in, some perspective is in order. Remember that in a properly configured Beowulf node, nearly all of the available CPU cycles and memory are devoted to the scientific application. As mentioned earlier, the Linux operating system will perform admirably with absolutely no changes. Trimming down the kernel and removing unneeded daemons and processes provides slightly more room for the host application. Tuning up the remaining very small kernel can further refine the results. Occasionally, a performance bottleneck can be dislodged with some simple kernel tuning. However, unless performance is awry, tinkering with parameters in '/proc' will more likely yield a little extra performance and a fascinating look at the interaction between Linux and the scientific application than incredible speed increases.

Now for a look at the Ethernet device:

% cat /proc/net/dev
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes
packets errs drop fifo colls carrier compressed
lo:363880104 559348 0 0 0 0 0 0 363880104 559348 0 0 0 0 0 0
eth0:1709724751 195793854 0 0 357 0 0 0 4105118568 202431445
0 0 0 0 481 0
brg0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

It is a bit hard to read, but the output is raw columnar data. A better formatting can be seen with '/sbin/ifconfig'. One set of important values is the total bytes and the total packets sent or received on an interface. Sometimes a little basic scientific observation and data gathering can go a long way. Are the numbers reasonable? Is application traffic using the correct interface? You may need to tune the default route to use a high-speed interface in favor of a 10-baseT Ethernet. Is something flooding your network? What is the size of the average packet? Another key set of values is for the collisions (colls), errs, drop, and frame. All of those values represent some degree of inefficiency in the Ethernet. Ideally, they will all be zero. A couple of dropped packets is usually nothing to fret about. But should those values grow at the rate of several per second, some serious problems are likely. The "collisions" count will naturally be nonzero if traffic goes through an Ethernet hub rather than an Ethernet switch. High collision rates for hubs are expected; that's why they are less expensive.

Tunable kernel parameters are in '/proc/sys'. Network parameters are generally in '/proc/sys/net'. Many parameters can be changed. Some administrators tweak a Beowulf kernel by modifying parameters such as tcp_sack, tcp_-timestamps, tcp_window_scaling, rmem_default, rmem_max, wmem_default, or wmem_max. The exact changes and values depend on the kernel version and networking configuration, such as private network, protected from denial of service attacks or a public network where each node must guard against SYN flooding and the like. You are encouraged to peruse the documentation available at www.linuxhq.com and other places where kernel documentation or source is freely distributed, to learn all the details pertaining to their system. Section 5.5 discusses some of these networking parameters in more detail.

With regard to memory, the meminfo handle provides many useful data points:

% cat /proc/meminfo
MemTotal:      1032828 kB
MemFree:         24916 kB
Buffers:        114836 kB
Cached:         436588 kB
SwapCached:      58796 kB
Active:         720008 kB
Inactive:       210888 kB
HighTotal:      130496 kB
HighFree:         2016 kB
LowTotal:       902332 kB
LowFree:         22900 kB
SwapTotal:      530136 kB
SwapFree:       389816 kB
Dirty:              64 kB
Writeback:           0 kB
Mapped:         390116 kB
Slab:            57136 kB
Committed_AS:   761696 kB
PageTables:       7636 kB
ReverseMaps:    202527

In the example output, the system has 1 gigabyte of RAM, about 114 megabytes allocated for buffers and 25 megabytes of free memory. The handles in '/proc/sys/ vm' can be used to tune the memory system, but their use depends on the kernel, since handles change frequently.

Like networking and virtual memory, there are many '/proc' handles for tuning or probing the file system. A node spawning many tasks can use many file handles. A standard ssh to a remote machine, where the connection is maintained, and not dropped, requires four file handles. The number of file handles permitted can be displayed with the command

% cat /proc/sys/fs/file-max
4096

The command for a quick look at the current system is

% cat /proc/sys/fs/file-nr
1157 728 4096

This shows the high-water mark (in this case, we have nothing to worry about), the current number of handles in use, and the max.

Once again, a simple echo command can increase the limit:

% echo 8192 > /proc/sys/fs/file-max

The utility '/sbin/hdparm' is especially handy at querying, testing, and even setting hard disk parameters:

% /sbin/hdparm -I /dev/hda

/dev/hda:

 Model=DW CDW01A0 A , FwRev=500.B550, SerialNo=DWW-AMC1211431 9
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
 BuffType=3(DualPortCache), BuffSize=2048kB, MaxMultSect=16, MultSect=8
 DblWordIO=no, maxPIO=2(fast), DMA=yes, maxDMA=0(slow)
 CurCHS=17475/15/63, CurSects=16513875, LBA=yes
 LBA CHS=512/511/63 Remapping, LBA=yes, LBAsects=19541088
 tDMA={min:120,rec:120}, DMA modes: mword0 mword1 mword2
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, PIO modes: mode3 mode4
 UDMA modes: mode0 model *mode2 }

Using a Beowulf builder and a simple disk test,

% /sbin/hdparm -t /dev/hdal

/dev/hdal:
Timing buffered disk reads: 64 MB in 20.05 seconds = 3.19 MB/sec

you can understand whether your disk is performing as it should, and as you expect.

Finally, some basic parameters of that kernel can be displayed or modified. '/proc/sys/kernel' contains structures. For some message-passing codes, the key may be '/proc/sys/kernel/shmmax'. It can be used to get or set the maximum size of shared-memory segments. For example,

% cat /proc/sys/kernel/shmmax
33554432

shows that the largest shared-memory segment available is 32 megabytes. Especially on an SMP, some messaging layers may use shared-memory segments to pass messages within a node, and for some systems and applications 32 megabytes may be too small.

All of these examples are merely quick forays into the world of '/proc'. Naturally, there are many, many more statistics and handles in '/proc' than can be viewed in this quick overview. You are encouraged to look on the Web for more complete documentation and to explore the Linux source—the definitive answer to the question "What will happen if I change this?" A caveat is warranted: You can make your Beowulf node perform worse as a result of tampering with kernel parameters. Good science demands data collection and repeatability. Both will go a long way toward ensuring that kernel performance increases, rather than decreases.




Part III: Managing Clusters