Section 12.5. Archive and Compression Utilities

When installing or upgrading software on Unix systems, the first things you need to be familiar with are the tools used for compressing and archiving files. Dozens of such utilities are available. Some of these (such as tar and compress) date back to the earliest days of Unix; others (such as gzip and the even newer bzip2) are relative newcomers. The main goal of these utilities is to archive files (that is, to pack many files together into a single file for easy transportation or backup) and to compress files (to reduce the amount of disk space required to store a particular file or set of files).

In this section, we're going to discuss the most common file formats and utilities you're likely to run into. For instance, a near-universal convention in the Unix world is to transport files or software as a tar archive, compressed using compress, gzip, or bzip2. In order to create or unpack these files yourself, you'll need to know the tools of the trade. The tools are most often used when installing new software or creating backupsthe subject of the following two sections in this chapter. Packages coming from other worlds, such as the Windows or Java world, are often archived and compressed using the zip utility; you can unpack these with the unzip command, which should be available in most Linux installations.^[*]

^[*] Notice that despite the similarity in names, zip on the one hand and gzip and bzip2 on the other hand do not have much in common. zip is both a packaging and compression tool, whereas gzip/bzip2 are for compression onlythey typically rely on tar for the actual packaging. Their formats are incompatible; you need to use the correct program for unpacking a certain package.

12.5.1. Using gzip and bzip2

gzip is a fast and efficient compression program distributed by the GNU project. The basic function of gzip is to take a file, compress it, save the compressed version as filename.gz, and remove the original, uncompressed file. The original file is removed only if gzip is successful; it is very difficult to accidentally delete a file in this manner. Of course, being GNU software, gzip has more options than you want to think about, and many aspects of its behavior can be modified using command-line options.

First, let's say that we have a large file named garbage.txt:

    rutabaga$ ls -l garbage.txt
    -rw-r--r--   1 mdw      hack       312996 Nov 17 21:44 garbage.txt

To compress this file using gzip, we simply use the command:

    gzip garbage.txt

This replaces garbage.txt with the compressed file garbage.txt.gz. What we end up with is the following:

    rutabaga$ gzip garbage.txt
    rutabaga$ ls -l garbage.txt.gz
    -rw-r--r--   1 mdw      hack       103441 Nov 17 21:44 garbage.txt.gz

Note that garbage.txt is removed when gzip completes.

You can give gzip a list of filenames; it compresses each file in the list, storing each with a .gz extension. (Unlike the zip program for Unix and MS-DOS systems, gzip will not, by default, compress several files into a single .gz archive. That's what tar is for; see the next section.)

How efficiently a file is compressed depends on its format and contents. For example, many graphics file formats (such as PNG and JPEG) are already well compressed, and gzip will have little or no effect upon such files. Files that compress well usually include plain-text files and binary files, such as executables and libraries. You can get information on a gzipped file using gzip -l. For example:

    rutabaga$ gzip -l garbage.txt.gz
    compressed  uncompr. ratio uncompressed_name
       103115    312996  67.0% garbage.txt

To get our original file back from the compressed version, we use gunzip, as in the following:

    gunzip garbage.txt.gz

After doing this, we get:

    rutabaga$ gunzip garbage.txt.gz
    rutabaga$ ls -l garbage.txt
    -rw-r--r--   1 mdw      hack       312996 Nov 17 21:44 garbage.txt

which is identical to the original file. Note that when you gunzip a file, the compressed version is removed once the uncompression is complete. Instead of using gunzip, you can also use gzip -d (e.g., if gunzip happens not to be installed).

gzip stores the name of the original, uncompressed file in the compressed version. This way, if the compressed filename (including the .gz extension) is too long for the filesystem type (say, you're compressing a file on an MS-DOS filesystem with 8.3 filenames), the original filename can be restored using gunzip even if the compressed file had a truncated name. To uncompress a file to its original filename, use the -N option with gunzip. To see the value of this option, consider the following sequence of commands:

    rutabaga$ gzip garbage.txt
    rutabaga$ mv garbage.txt.gz rubbish.txt.gz

If we were to gunzip rubbish.txt.gz at this point, the uncompressed file would be named rubbish.txt, after the new (compressed) filename. However, with the -N option, we get the following:

    rutabaga$ gunzip -N rubbish.txt.gz
    rutabaga$ ls -l garbage.txt
    -rw-r--r--   1 mdw      hack       312996 Nov 17 21:44 garbage.txt

gzip and gunzip can also compress or uncompress data from standard input and output. If gzip is given no filenames to compress, it attempts to compress data read from standard input. Likewise, if you use the -c option with gunzip, it writes uncompressed data to standard output. For example, you could pipe the output of a command to gzip to compress the output stream and save it to a file in one step:

    rutabaga$ ls -laR $HOME | gzip > filelist.gz

This will produce a recursive directory listing of your home directory and save it in the compressed file filelist.gz. You can display the contents of this file with the command:

    rutabaga$ gunzip -c filelist.gz | more

This will uncompress filelist.gz and pipe the output to the more command. When you use gunzip -c, the file on disk remains compressed.

The zcat command is identical to gunzip -c. You can think of this as a version of cat for compressed files. Linux even has a version of the pager less for compressed files, called zless.

When compressing files, you can use one of the options -1 through -9 to specify the speed and quality of the compression used. -1 (also --fast) specifies the fastest method, which compresses the files less compactly, and -9 (also -- --best) uses the slowest, but best compression method. If you don't specify one of these options, the default is -6. None of these options has any bearing on how you use gunzip; gunzip will be able to uncompress the file no matter what speed option you use.

Compared with the more than three decades long history of Unix, gzip is relatively new in the Unix world. The compression programs used on most Unix systems are compress and uncompress, which were included in the original Berkeley versions of Unix. compress and uncompress are very much like gzip and gunzip, respectively; compress saves compressed files as filename.Z as opposed to filename.gz, and uses a slightly less efficient compression algorithm.

However, the free software community has been moving to gzip for several reasons. First of all, gzip works better. Second, there has been a patent dispute over the compression algorithm used by compressthe results of which could prevent third parties from implementing the compress algorithm on their own. Because of this, the Free Software Foundation urged a move to gzip, which at least the Linux community has embraced. gzip has been ported to many architectures, and many others are following suit. Happily, gunzip is able to uncompress the .Z format files produced by compress.

Another compression/decompression program has also emerged to take the lead from gzip. bzip2 is the newest kid on the block and sports even better compression (on the average about 10% to 20% better than gzip), at the expense of longer compression times. You cannot use bunzip2 to uncompress files compressed with gzip and vice versa, and because you cannot expect everybody to have bunzip2 installed on their machine, you might want to confine yourself to gzip for the time being if you want to send the compressed file to somebody else. However, it pays to have bzip2 installed because more and more FTP servers now provide bzip2-compressed packages in order to conserve disk space and bandwidth. It is not unlikely that in a few years from now, gzip will be as uncommon in the Linux world as compress is today. You can recognize bzip2-compressed files by their .bz2 filename extension.

Although the command-line options of bzip2 are not exactly the same as those of gzip, those that have been described in this section are. For more information, see the bzip2(1) manual page.

The bottom line is that you should use gzip/gunzip or bzip2/bunzip2 for your compression needs. If you encounter a file with the extension .Z, it was probably produced by compress, and gunzip can uncompress it for you.

Earlier versions of gzip used .z (lowercase) instead of .gz as the compressed-filename extension. Because of the potential confusion with .Z, this was changed. At any rate, gunzip retains backward compatibility with a number of filename extensions and file types.

12.5.2. Using tar

tar is a general-purpose archiving utility capable of packing many files into a single archive file, while retaining information needed to restore the files fully, such as file permissions and ownership. The name tar stands for tape archive because the tool was originally used to archive files as backups on tape. However, use of tar is not at all restricted to making tape backups, as we'll see.

The format of the tar command is:

    tar functionoptions files...

where function is a single letter indicating the operation to perform, options is a list of (single-letter) options to that function, and files is the list of files to pack or unpack in an archive. (Note that function is not separated from options by any space.)

function can be one of the following:

c: To create a new archive
x: To extract files from an archive
t: To list the contents of an archive
r: To append files to the end of an archive
u: To update files that are newer than those in the archive
d: To compare files in the archive to those in the filesystem

You'll rarely use most of these functions; the more commonly used are c, x, and t.

The most common options are

k: To keep any existing files when extractingthat is, to not overwrite any existing files that are contained within the tar file.
f filename: To specify that the tar file to be read or written is filename.
z: To specify that the data to be written to the tar file should be compressed or that the data in the tar file is compressed with gzip.
j: Like z, but uses bzip2 instead of gzip; works only with newer versions of tar. Some intermediate versions of tar used I instead; older ones don't support bzip2 at all.
v: To make tar show the files it is archiving or restoring. It is good practice to use this so that you can see what actually happens (unless, of course, you are writing shell scripts).

There are others, which we cover later in this section.

Although the tar syntax might appear complex at first, in practice it's quite simple. For example, say we have a directory named mt, containing these files:

    rutabaga$ ls -l mt
    total 37
    -rw-r--r--   1 root     root           24 Sep 21  2004 Makefile
    -rw-r--r--   1 root     root          847 Sep 21  2004 README
    -rwxr-xr-x   1 root     root         9220 Nov 16 19:03 mt
    -rw-r--r--   1 root     root         2775 Aug  7  2004 mt.1
    -rw-r--r--   1 root     root         6421 Aug  7  2004 mt.c
    -rw-r--r--   1 root     root         3948 Nov 16 19:02 mt.o
    -rw-r--r--   1 root     root        11204 Sep  5  2004 st_info.txt

We wish to pack the contents of this directory into a single tar archive. To do this, we use the command:

    tar cf mt.tar mt

The first argument to tar is the function (here, c, for create) followed by any options. Here, we use the option f mt.tar to specify that the resulting tar archive be named mt.tar. The last argument is the name of the file or files to archive; in this case, we give the name of a directory, so tar packs all files in that directory into the archive.

Note that the first argument to tar must be the function letter and options. Because of this, there's no reason to use a hyphen (-) to precede the options as many Unix commands require. tar allows you to use a hyphen, as in:

    tar -cf mt.tar mt

but it's really not necessary. In some versions of tar, the first letter must be the function, as in c, t, or x. In other versions, the order of letters does not matter.

The function letters as described here follow the so-called "old option style." There is also a newer "short option style" in which you precede the function options with a hyphen, and a "long option style" in which you use long option names with two hyphens. See the Info page for tar for more details if you are interested.

Be careful to remember the filename if you use the cf function letters. Otherwise tar will overwrite the first file in your list of files to pack because it will mistake that for the filename!

It is often a good idea to use the v option with tar; this lists each file as it is archived. For example:

    rutabaga$ tar cvf mt.tar mt
    mt/
    mt/st_info.txt
    mt/README
    mt/mt.1
    mt/Makefile
    mt/mt.c
    mt/mt.o
    mt/mt

If you use v multiple times, additional information will be printed:

    rutabaga$ tar cvvf mt.tar mt
    drwxr-xr-x root/root         0 Nov 16 19:03 2004 mt/
    -rw-r--r-- root/root     11204 Sep  5 13:10 2004 mt/st_info.txt
    -rw-r--r-- root/root       847 Sep 21 16:37 2004 mt/README
    -rw-r--r-- root/root      2775 Aug  7 09:50 2004 mt/mt.1
    -rw-r--r-- root/root        24 Sep 21 16:03 2004 mt/Makefile
    -rw-r--r-- root/root      6421 Aug  7 09:50 2004 mt/mt.c
    -rw-r--r-- root/root      3948 Nov 16 19:02 2004 mt/mt.o
    -rwxr-xr-x root/root      9220 Nov 16 19:03 2004 mt/mt

This is especially useful because it lets you verify that tar is doing the right thing.

In some versions of tar, f must be the last letter in the list of options. This is because tar expects the f option to be followed by a filenamethe name of the tar file to read from or write to. If you don't specify f filename at all, tar assumes for historical reasons that it should use the device /dev/rmt0 (that is, the first tape drive). In "Making Backups," in Chapter 27, we talk about using tar in conjunction with a tape drive to make backups.

Now, we can give the file mt.tar to other people, and they can extract it on their own system. To do this, they would use the following command:

    tar xvf mt.tar

This creates the subdirectory mt and places all the original files into it, with the same permissions as found on the original system. The new files will be owned by the user running the tar xvf (you) unless you are running as root, in which case the original owner is preserved. The x option stands for "extract." The v option is used again here to list each file as it is extracted. This produces:

    courgette% tar xvf mt.tar
    mt/
    mt/st_info.txt
    mt/README
    mt/mt.1
    mt/Makefile
    mt/mt.c
    mt/mt.o
    mt/mt

We can see that tar saves the pathname of each file relative to the location where the tar file was originally created. That is, when we created the archive using tar cf mt.tar mt, the only input filename we specified was mt, the name of the directory containing the files. Therefore, tar stores the directory itself and all the files below that directory in the tar file. When we extract the tar file, the directory mt is created and the files placed into it, which is the exact inverse of what was done to create the archive.

By default, tar extracts all tar files relative to the current directory where you execute tar. For example, if you were to pack up the contents of your /bin directory with the command:

    tar cvf bin.tar /bin

tar would give the warning:

    tar: Removing leading / from absolute pathnames in the archive.

What this means is that the files are stored in the archive within the subdirectory bin. When this tar file is extracted, the directory bin is created in the working directory of tarnot as /bin on the system where the extraction is being done. This is very important and is meant to prevent terrible mistakes when extracting tar files. Otherwise, extracting a tar file packed as, say, /bin would trash the contents of your /bin directory when you extracted it.^[*] If you really wanted to extract such a tar file into /bin, you would extract it from the root directory, /. You can override this behavior using the P option when packing tar files, but it's not recommended you do so.

^[*] Some (older) implementations of Unix (e.g., Sinix and Solaris) do just that.

Another way to create the tar file mt.tar would have been to cd into the mt directory itself, and use a command such as:

    tar cvf mt.tar *

This way the mt subdirectory would not be stored in the tar file; when extracted, the files would be placed directly in your current working directory. One fine point of tar etiquette is to always pack tar files so that they have a subdirectory at the top level, as we did in the first example with tar cvf mt.tar mt. Therefore, when the archive is extracted, the subdirectory is also created and any files placed there. This way you can ensure that the files won't be placed directly in your current working directory; they will be tucked out of the way and prevent confusion. This also saves the person doing the extraction the trouble of having to create a separate directory (should they wish to do so) to unpack the tar file. Of course, there are plenty of situations where you wouldn't want to do this. So much for etiquette.

When creating archives, you can, of course, give tar a list of files or directories to pack into the archive. In the first example, we have given tar the single directory mt, but in the previous paragraph we used the wildcard *, which the shell expands into the list of filenames in the current directory.

Before extracting a tar file, it's usually a good idea to take a look at its table of contents to determine how it was packed. This way you can determine whether you do need to create a subdirectory yourself where you can unpack the archive. A command such as:

    tar tvf tarfile

lists the table of contents for the named tarfile. Note that when using the t function, only one v is required to get the long file listing, as in this example:

    courgette% tar tvf mt.tar
    drwxr-xr-x root/root         0 Nov 16 19:03 2004 mt/
    -rw-r--r-- root/root     11204 Sep  5 13:10 2004 mt/st_info.txt
    -rw-r--r-- root/root       847 Sep 21 16:37 2004 mt/README
    -rw-r--r-- root/root      2775 Aug  7 09:50 2004 mt/mt.1
    -rw-r--r-- root/root        24 Sep 21 16:03 2004 mt/Makefile
    -rw-r--r-- root/root      6421 Aug  7 09:50 2004 mt/mt.c
    -rw-r--r-- root/root      3948 Nov 16 19:02 2004 mt/mt.o
    -rwxr-xr-x root/root      9220 Nov 16 19:03 2004 mt/mt

No extraction is being done here; we're just displaying the archive's table of contents. We can see from the filenames that this file was packed with all files in the subdirectory mt, so that when we extract the tar file, the directory mt will be created and the files placed there.

You can also extract individual files from a tar archive. To do this, use the command:

    tar xvf tarfile files

where files is the list of files to extract. As we've seen, if you don't specify any files, tar extracts the entire archive.

When specifying individual files to extract, you must give the full pathname as it is stored in the tar file. For example, if we wanted to grab just the file mt.c from the previous archive mt.tar, we'd use the command:

    tar xvf mt.tar mt/mt.c

This would create the subdirectory mt and place the file mt.c within it.

tar has many more options than those mentioned here. These are the features that you're likely to use most of the time, but GNU tar, in particular, has extensions that make it ideal for creating backups and the like. See the tar manual page and the following section for more information.

12.5.3. Using tar with gzip and bzip2

tar does not compress the data stored in its archives in any way. If you are creating a tar file from three 200 K files, you'll end up with an archive of about 600 K. It is common practice to compress tar archives with gzip (or the older compress program). You could create a gzipped tar file using the commands:

    tar cvf tarfile files...
    gzip -9 tarfile

But that's so cumbersome, and requires you to have enough space to store the uncompressed tar file before you gzip it.

A much trickier way to accomplish the same task is to use an interesting feature of tar that allows you to write an archive to standard output. If you specify - as the tar file to read or write, the data will be read from or written to standard input or output. For example, we can create a gzipped tar file using the command:

    tar cvf - files... | gzip -9 > tarfile.tar.gz

Here, tar creates an archive from the named files and writes it to standard output; next, gzip reads the data from standard input, compresses it, and writes the result to its own standard output; finally, we redirect the gzipped tar file to tarfile.tar.gz.

We could extract such a tar file using the command:

    gunzip -c tarfile.tar.gz | tar xvf -

gunzip uncompresses the named archive file and writes the result to standard output, which is read by tar on standard input and extracted. Isn't Unix fun?

Of course, both commands are rather cumbersome to type. Luckily, the GNU version of tar provides the z option, which automatically creates or extracts gzipped archives. (We saved the discussion of this option until now, so you'd truly appreciate its convenience.) For example, we could use the commands:

    tar cvzf tarfile.tar.gz files...

and

    tar xvzf tarfile.tar.gz

to create and extract gzipped tar files. Note that you should name the files created in this way with the .tar.gz filename extensions (or the equally often used .tgz, which also works on systems with limited filename capabilities) to make their format obvious. The z option works just as well with other tar functions, such as t.

Only the GNU version of tar supports the z option; if you are using tar on another Unix system, you may have to use one of the longer commands to accomplish the same tasks. Nearly all Linux systems use GNU tar.

When you want to use tar in conjunction with bzip2, you need to tell tar about your compression program preferences, like this:

    tar cvf tarfile.tar.bz2 --use-compress-program=bzip2 files...

Or, shorter:

tar cvf tarfile.tar.bz2 --use-compress-program=bzip2 files...

Or, shorter still:

    tar cvjf tarfile.tar.bz2 files

The last version works only with newer versions of GNU tar that support the j option.

Keeping this in mind, you could write short shell scripts or aliases to handle cookbook tar file creation and extraction for you. Under bash, you could include the following functions in your .bashrc:

    tarc ( ) { tar czvf $1.tar.gz $1 }
    tarx ( ) { tar xzvf $1 }
    tart ( ) { tar tzvf $1 }

With these functions, to create a gzipped tar file from a single directory, you could use the command:

    tarc directory

The resulting archive file would be named directory.tar.gz. (Be sure that there's no trailing slash on the directory name; otherwise, the archive will be created as .tar.gz within the given directory.) To list the table of contents of a gzipped tar file, just use

    tart file.tar.gz

Or, to extract such an archive, use:

    tarx file.tar.gz

As a final note, we would like to mention that files created with gzip and/or tar can be unpacked with the well-known WinZip utility on Windows systems. WinZip doesn't have support for bzip2 yet, though. If you, on the other hand, get a file in .zip format, you can unpack it on your Linux system using the unzip command.

12.5.4. tar Tricks

Because tar saves the ownership and permissions of files in the archive and retains the full directory structure, as well as symbolic and hard links, using tar is an excellent way to copy or move an entire directory tree from one place to another on the same system (or even between different systems, as we'll see). Using the - syntax described earlier, you can write a tar file to standard output, which is read and extracted on standard input elsewhere.

For example, say that we have a directory containing two subdirectories: from-stuff and to-stuff. from-stuff contains an entire tree of files, symbolic links, and so forthsomething that is difficult to mirror precisely using a recursive cp. To mirror the entire tree beneath from-stuff to to-stuff, we could use the commands:

    cd from-stuff
    tar cf - . | (cd ../to-stuff; tar xvf -)

Simple and elegant, right? We start in the directory from-stuff and create a tar file of the current directory, which is written to standard output. This archive is read by a subshell (the commands contained within parentheses); the subshell does a cd to the target directory, ../to-stuff (relative to from-stuff, that is), and then runs tar xvf, reading from standard input. No tar file is ever written to disk; the data is sent entirely via pipe from one tar process to another. The second tar process has the v option that prints each file as it's extracted; in this way, we can verify that the command is working as expected.

In fact, you could transfer directory trees from one machine to another (via the network) using this trickjust include an appropriate rsh (or ssh) command within the subshell on the right side of the pipe. The remote shell would execute tar to read the archive on its standard input. (Actually, GNU tar has facilities to read or write tar files automatically from other machines over the network; see the tar(1) manual page for details.)