Text Utilities

The GNU text utility package includes a large number of utilities to manipulate the contents of text files. These utilities are patterned after the UNIX commands of the same name. The GNU versions of the programs usually have additional options and are optimized for speed. In general, the GNU utilities do not have any of the arbitrary limitations of their UNIX counterparts.

Table 8-4 briefly describes each text utility. The best way to learn these utilities is to try each one out. Before trying out a program, type info progname or man progname (where progname is the name of the program) to view the online help information.

Cross Ref

A few selected text utilities are described in the following sections. You can find reference pages for many of these utilities in Appendix A.

Table 8-4: GNU Text Utilities
Program	Description
cat	Concatenates files and writes them to standard output
cksum	Prints the cyclic redundancy check (CRC) checksums and byte counts of files (used to verify that files have not been corrupted in transmission, by comparing the cksum output for the received files with the cksum output for the original files)
comm	Compares two sorted files line by line. (The sort command sorts files.)
csplit	Splits a file into sections determined by text patterns in the file and places each section in a separate file named xx00, xx01, and so on
cut	Removes sections from each line of files and writes them to standard output
expand	Converts tabs in each file to spaces and writes the result to standard output
fmt	Fills and joins lines, making each line roughly the same length, and writes the formatted lines to standard output
fold	Breaks lines in a file so that each line is no wider than a specified width, and writes the lines to standard output
head	Prints the first part of files
join	Joins corresponding lines of two files using a common field and writes each line to standard output
md5sum	Computes and checks the MD5 message digest (a 128-bit checksum using the MD5 algorithm)
nl	Numbers each line in a file and writes the lines to standard output
od	Writes the contents of files to standard output in octal and other formats. (This is used to view the contents of binary files.)
paste	Merges corresponding lines of one or more files into vertical columns separated by tabs, and writes each line to standard output
pr	Formats text files for printing
ptx	Produces a permuted index of file contents
sort	Sorts lines of text files
split	Splits a file into pieces
sum	Computes and prints a 16-bit checksum for each file and counts the number of 1,024-byte blocks in the file
tac	Writes each file to standard output, last line first
tail	Prints the last part of files
tr	Translates or deletes characters in files
tsort	Performs a topological sort (used to organize a library for efficient handling by the ar and ld commands)
unexpand	Converts spaces into tabs
uniq	Removes duplicate lines from a sorted file
wc	Prints the number of bytes, words, and lines in files

Counting Words and Lines in a Text File

For example, suppose that you want to use the wc command to display the character, word, and line count of a text file. Try the following:

wc /etc/inittab
     54     236    1698 /etc/inittab

This causes wc to display the number of lines (54), words (236), and characters (1698) in the /etc/inittab file. If you simply want to see the number of lines in a file, use the -l option:

wc -l /etc/inittab
     54 /etc/inittab

As you can see, in this case, wc simply displays the line count.

If you don’t specify a filename, the wc command expects input from the standard input. You can use the pipe feature of the shell to feed the output of another command to wc. This can be handy sometimes. Suppose that you want a rough count of the processes running on your system. You can get a list of all processes with the ps ax command, but instead of manually counting the lines, just pipe the output of ps to wc, and you can get a rough count, as follows:

ps ax | wc -l
     65

That means that the ps command has produced 65 lines of output. Because the first line simply shows the headings for the tabular columns, you can estimate that about 64 processes are running on your system. (Of course, this count probably includes the processes used to run the ps and wc commands as well, but who’s counting?)

Sorting Text Files

You can sort the lines in a text file by using the sort command. To see how the sort command works, first type more /etc/passwd to see the current contents of the /etc/passwd file. Now, type sort /etc/passwd to see the lines sorted alphabetically. If you want to sort a file and save the sorted version in another file, you have to use the Bash shell’s output redirection feature, as follows:

sort /etc/passwd > ~/sorted.text

This command sorts the lines in the /etc/passwd file and saves the output in a file named sorted.text in your home directory.

Substituting or Deleting Characters from a File

Another interesting command is tr—it substitutes one group of characters for another (or deletes a selected character) throughout a file. The tr command is useful when you want to convert a text file from one operating system to another because different operating systems use different special characters to mark the end of a line of text.

Suppose that you occasionally have to use MS-DOS text files on your Linux system. Although you might expect to use a text file on any system without any problems, there is one catch: DOS uses a carriage return followed by a line feed to mark the end of each line, whereas Linux (and other UNIX systems) use only a line feed. Therefore, if you use the vi editor with the -b option to open a DOS text file (for example, type vi -b filename to open the file), you see ^M at the end of each line. That ^M stands for Ctrl-M, which is the carriage-return character.

On your Linux system, you can easily rid the DOS text file of the extra carriage returns by using the tr command with the -d option. Essentially, to convert the DOS text file filename.dos to a Linux text file named filename.linux, type the following:

tr -d '\015' < filename.dos > filename.linux

In this command, '\015' denotes the ASCII code in octal notation for the carriage-return character.

You can use the tr command to translate or delete characters from the input. When you use tr with the -d option, it deletes all occurrences of a specific character from the input data. Following the -d option, you must specify the character to be deleted. Like many UNIX utilities, tr reads the standard input and writes its output to standard output. As the sample command shows, you must employ input and output redirection to use tr to delete all occurrences of a character in a file and save the output in another file.

Splitting a File into Several Smaller Files

The split command is handy when you want to copy a file to a floppy disk but the file is too large to fit on a single floppy. You can then use the split command to break up the file into smaller files, each of which can fit on a floppy.

By default, split puts 1,000 lines into each file. The files are named by groups of letters such as aa, ab, ac, and so on. You can specify a prefix for the filenames. For example, to split a large file called hugefile.tar into smaller files that fit onto several high-density 3.5-inch floppy disks, use split as follows:

split -b 1440k hugefile.tar part.

This command splits the hugefile.tar file into 1,440K chunks so that each can fit onto a floppy disk. The command creates files named part.aa, part.ab, part.ac, and so on.

To combine the split files back into a single file, use the cat command as follows: