3.4 Scalable Services

3.4 Scalable Services

Modern operating systems take network connectivity for granted, and are almost always configured by default to rely on basic network services for everything from the correct time of day to DNS name resolution. This can cause performance bottlenecks for large clusters. Consider a 1024 node cluster launching a job yet configured to use the campus-wide DNS server for resolving names. Often, as TCP connections are made nodes are configured to do a reverse lookup. This could result in thousands of near-simultaneous requests to a server that could scale poorly. As mentioned earlier, NFS can also fall in to this category, usually scaling only to about 64 nodes. NIS can be another potential bottleneck. NIS, the Network Information System is often used to provide network-shared configuration data, such as password files. Every time a user logs into a node, the computer consults the remote NIS server. Naturally, spending a few moments to examine the remote services the operating system uses can be important. Many Beowulf builders simply eliminate, wherever possible, the use of remote services such as NIS for synchronizing accounts.




Part III: Managing Clusters