Text Utilities


Text Utilities

The GNU text utility package includes a large number of utilities to manipulate the contents of text files. These utilities are patterned after the UNIX commands of the same name. The GNU versions of the programs usually have additional options and are optimized for speed. In general, the GNU utilities do not have any of the arbitrary limitations of their UNIX counterparts.

Table 8-4 briefly describes each text utility. The best way to learn these utilities is to try each one out. Before trying out a program, type info progname or man progname (where progname is the name of the program) to view the online help information.

Cross Ref 

A few selected text utilities are described in the following sections. You can find reference pages for many of these utilities in Appendix A.

Table 8-4: GNU Text Utilities

Program

Description

cat

Concatenates files and writes them to standard output

cksum

Prints the cyclic redundancy check (CRC) checksums and byte counts of files (used to verify that files have not been corrupted in transmission, by comparing the cksum output for the received files with the cksum output for the original files)

comm

Compares two sorted files line by line. (The sort command sorts files.)

csplit

Splits a file into sections determined by text patterns in the file and places each section in a separate file named xx00, xx01, and so on

cut

Removes sections from each line of files and writes them to standard output

expand

Converts tabs in each file to spaces and writes the result to standard output

fmt

Fills and joins lines, making each line roughly the same length, and writes the formatted lines to standard output

fold

Breaks lines in a file so that each line is no wider than a specified width, and writes the lines to standard output

head

Prints the first part of files

join

Joins corresponding lines of two files using a common field and writes each line to standard output

md5sum

Computes and checks the MD5 message digest (a 128-bit checksum using the MD5 algorithm)

nl

Numbers each line in a file and writes the lines to standard output

od

Writes the contents of files to standard output in octal and other formats. (This is used to view the contents of binary files.)

paste

Merges corresponding lines of one or more files into vertical columns separated by tabs, and writes each line to standard output

pr

Formats text files for printing

ptx

Produces a permuted index of file contents

sort

Sorts lines of text files

split

Splits a file into pieces

sum

Computes and prints a 16-bit checksum for each file and counts the number of 1,024-byte blocks in the file

tac

Writes each file to standard output, last line first

tail

Prints the last part of files

tr

Translates or deletes characters in files

tsort

Performs a topological sort (used to organize a library for efficient handling by the ar and ld commands)

unexpand

Converts spaces into tabs

uniq

Removes duplicate lines from a sorted file

wc

Prints the number of bytes, words, and lines in files

Counting Words and Lines in a Text File

For example, suppose that you want to use the wc command to display the character, word, and line count of a text file. Try the following:

wc /etc/inittab
     54     236    1698 /etc/inittab

This causes wc to display the number of lines (54), words (236), and characters (1698) in the /etc/inittab file. If you simply want to see the number of lines in a file, use the -l option:

wc -l /etc/inittab
     54 /etc/inittab

As you can see, in this case, wc simply displays the line count.

If you don’t specify a filename, the wc command expects input from the standard input. You can use the pipe feature of the shell to feed the output of another command to wc. This can be handy sometimes. Suppose that you want a rough count of the processes running on your system. You can get a list of all processes with the ps ax command, but instead of manually counting the lines, just pipe the output of ps to wc, and you can get a rough count, as follows:

ps ax | wc -l
     65

That means that the ps command has produced 65 lines of output. Because the first line simply shows the headings for the tabular columns, you can estimate that about 64 processes are running on your system. (Of course, this count probably includes the processes used to run the ps and wc commands as well, but who’s counting?)

Sorting Text Files

You can sort the lines in a text file by using the sort command. To see how the sort command works, first type more /etc/passwd to see the current contents of the /etc/passwd file. Now, type sort /etc/passwd to see the lines sorted alphabetically. If you want to sort a file and save the sorted version in another file, you have to use the Bash shell’s output redirection feature, as follows:

sort /etc/passwd > ~/sorted.text

This command sorts the lines in the /etc/passwd file and saves the output in a file named sorted.text in your home directory.

Substituting or Deleting Characters from a File

Another interesting command is tr—it substitutes one group of characters for another (or deletes a selected character) throughout a file. The tr command is useful when you want to convert a text file from one operating system to another because different operating systems use different special characters to mark the end of a line of text.

Splitting a File into Several Smaller Files

The split command is handy when you want to copy a file to a floppy disk but the file is too large to fit on a single floppy. You can then use the split command to break up the file into smaller files, each of which can fit on a floppy.

By default, split puts 1,000 lines into each file. The files are named by groups of letters such as aa, ab, ac, and so on. You can specify a prefix for the filenames. For example, to split a large file called hugefile.tar into smaller files that fit onto several high-density 3.5-inch floppy disks, use split as follows:

split -b 1440k hugefile.tar part.

This command splits the hugefile.tar file into 1,440K chunks so that each can fit onto a floppy disk. The command creates files named part.aa, part.ab, part.ac, and so on.

To combine the split files back into a single file, use the cat command as follows:

cat part.?? > hugefile.tar