Large-Scale SysAdmin

People ask me a lot about how best to administer large sets of machines. I've worked on this problem before, so you would think that I would have lots of answers. Unfortunately, this is unsolved problem area and there are no great answers that I can give you. Hopefully that will change someday.

This page is meant to describe the state of the art (from my point of view) in the administration of large groups of machines. These machines may be clusters, independent workstations, labs, or completey heterogeneous machines that just fall under the same administrative domain. I don't do Windows, but some of the tools described below may help you with those beasts as well.

In my mind, at least three major functions are needed. First, you should be able to perform a new install on a system with a very small amount of human intervention. Second, you should be able to keep existing machines patched and deploy new software on them. Third, you should be able to perform large system upgrades on existing systems.

Disk Cloning

The most primitive, and perhaps oldest, technique is to create a prototype system by hand and then backup a copy of that system to be replicated on other systems. A lot of computer vendors do this to create their pre-installed systems. In many cases, the cloning is done by bulk copying the hard disk. Dump/restore or dd are useful for this.

After the system is cloned, you will usually need to give it some configuration (IP address, hostname, etc.), but you may be able to automate this with BOOTP or DHCP.

JumpStart/KickStart

Sun created the JumpStart system for Solaris boxes. You boot a system off the net (which is trivial with the firmware found in workstations like Suns) and perform an automated install.

Red Hat later created a similar system for Linux machines. Since the average PC and NIC card cannot boot off the net, there are tools to make your own CD or floppies to get the system on the network.

It is my understanding that JumpStart and KickStart are designed for moderately homogeneous environments where the same software set is desired on all machines.

VA Linux has a system called VA System Imager that supports multiple Linux distributions.

CFengine

CFengine is a "software robot" that reads high-level descriptions of that machines, or classes of machines, and performs the operations described. CFengine is a good framework, but shifts the burden from performing actions on machines to documenting those actions in a description language.

Pros: Pretty popular, so it is easy to find documentation and staff who already have experience with it.

Cons: I find it not much more advanced than writing shell scripts to run on your machines. Various people have cobbled together systems for organizing and running groups of scripts on systems. This is just a fancy way to do that. Some people will likely disagree with this assessment.

Packages

Tools like JumpStart and KickStart avoid cloning raw disks by having all vendor-provided software contained in discrete packages that can be installed or not installed. Packages must contain some degree of dependancy information so that a user does not install a package that depends on other packages that are not installed.

Packaging systems usually come with command-line tools for installing and removing them. These tools can be used by scripts or programs like CFengine.

Debian FAI/dpkg/APT/debconf

The Debian project is primarily a community-developed GNU/Linux distribution (and my favorite for many reasons), but they also have GNU/Hurd distributions and may someday have a GNU/Win32 distribution. To manage distributions, they have a package format (analagous to RedHat's RPM). Unlike RedHat, there is a well documented set of standards that all Debian packages must adhere to. These standards require that Debian packages use some debian-specific tools to manage various aspects of system configuration and maintenance. The result is a beautifully modular system and functions very well.

In addition, they provide the Advanced Package Tool (APT) to automate the downloading and installation of packages. At installation time, individual packages determine their configurations by querying the debconf database and interfaces. Debconf can be configured to prompt the user (which is how most people use it), or it can be configured to not ask any questions and to get all configuration information out of a pre-defined database. This database file can generated and distributed by administrators. Debconf is still relatively new and not all Debian packages use it yet, but by the 2.3 release it will hopefully by fully used.

The Fully Automatic Installation project ties everything together to provides kickstart/jumpstart-like unattendend installs, but in a flexible way that allows for heterogeneous configurations. It can be used in conjunction with CFengine maintain systems throughout their lifecyle.

These tools could be ported to other operating systems, but that is a laborious task that has not been accomplished.

Package Depots

Most commercial Unices come with only minimal amounts of software. Most environments install large numbers of other commercial and open-source packages on top of these systems. Prior to the Linux distributions, this software was usually downloaded in tar format and compiled by each site. Often this software is put in a file system such as /usr/local and then NFS mounted by large numbers of systems.

Various sites have created tools to manage large /usr/local filesystems, frequently by installing packages in their own directory trees and then merging them into a /usr/local filesystem by linking or copying files. Some of these tools also provide mechanisms for managing a set of filesystems for multiple architectures.

I wrote a tool called Packagelink that does just that, but I am no longer supporting it and you are probably better off looking at some of the following alternatives:

Gutinteg/Packagelink

After using Packagelink for managing a /usr/local filesystem on multiple architectures, I decided that the idea should be extended to managing whole machines. So, I wrote a system of tools collectively known as Gutinteg that lets you specify configurations in a high-level language. Systems can be booted from the network and their disks will be partitioned and formatted and the specified set of software will be installed. Configuration is handled by creating packages that mark their configuration files so they can be run through a preprocessor that uses configuration information from the high-level configuration files.

Running systems can be re-assimilated on the fly without taking them down. All files on the filesystem will be replaced by the version of that file in the package. This model of administration makes centralized changes easy, but means that you changes made directly to machines, rather than to the packages or configuration database, will be lost. This is a hard paradigm shift for some administrators.

Cons: No garbage collection of files installed by packages that no longer exist.

So, what is the answer

First, think about the following questions:
  1. Do you want to share software, or run a set of systems? If you just want software, stop here and use a package system like a Software Depot, RPM, or dpkg.
  2. Are all the systems going to be administered by the same group of people? If not, stop here and use a package system and share configuration scripts or cfengine files. The individual administrators will have to apply the changes themselves.
  3. Are all the systems going to be homogeneous? If so, then just set up a cloning mechanism and use something like CFengine to manage configuration.
  4. Can all of the systems run Debian? If so, then use FAI and debconf.
  5. If you've gotten this far in the list, then you're in trouble. This is where I was. The gutinteg system handles this, but I don't have the resources to help you figure out how to use it. You may be able to assemble your own alternative using some off-the-shelf components:

If you have comments or suggestions, please let me know.