Text Processing Utilities

Text Processing Utilities

Solaris has many user commands available to perform tasks ranging from text processing, to file manipulation, to terminal management. In this section, we will look at some standard UNIX utilities that are the core of using a shell in Solaris. However, readers are urged to obtain an up-to-date list of the utilities supplied with Solaris by typing the command:

$ man intro

The cat command displays the contents of a file to standard output, without any kind of pagination or screen control. It is most useful for viewing small files, or for passing the contents of a text file through another filter or utility (for example, the grep command, which searches for strings). To examine the contents of the groups database, for example, we would use this command:

# cat /etc/group
root::0:root
other::1:
bin::2:root,bin,daemon
sys::3:root,bin,sys,adm
adm::4:root,adm,daemon
uucp::5:root,uucp
mail::6:root
tty::7:root,tty,adm
lp::8:root,lp,adm
nuucp::9:root,nuucp
staff::10:
postgres::100:
daemon::12:root,daemon
sysadmin::14:
nobody::60001:
noaccess::60002:
nogroup::65534:

The cat command is not very useful for examining specific sections of a file. For example, if you need to examine the first few lines of a web server's log files, using cat would display them but they would quickly scroll off the screen out of sight. However, you can use the head command to display only the first few lines of a file. In this example, we extract the lines from the log file of the Inprise application server:

bart:/usr/local/inprise/ias41/logs/bart/webpageservice > head access_log
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000] "GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000] "GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000] "GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000] "GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000] "GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000] "GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000] "GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:53 +1000] "GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:53 +1000] "GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:53 +1000] "GET /index.jsp HTTP/1.0" 200 24077

Alternatively, if you just want examine the last few lines of a file, you could use the cat command to display the entire file ending with the last few lines, or you could use the tail command to specifically display these lines. If the file is large (for example, an Inprise application server log file of 2MB), it would be a large waste of system resources to display the whole file using cat, whereas tail is very efficient. Here's an example of using tail to display the last few lines of a file:

bart:/usr/local/inprise/ias41/logs/bart/webpageservice > tail access_log
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000]
"GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000]
"GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000]
"GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000]
"GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000]
"GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000]
"GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:52 +1000]
"GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:53 +1000]
"GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:53 +1000]
"GET /index.jsp HTTP/1.0" 200 24077
203.16.206.43 - - [31/Aug/2000:14:32:53 +1000]
"GET /index.jsp HTTP/1.0" 200 24077

Now, imagine that you were searching for a particular string within the access_log file, such as a 404 error code, which indicates that a page has been requested that does not exist. Webmasters regularly check log files for this error code, to create a list of links that need to be checked. To view this list, we can use the grep command to search the file for a specific string (in this case, "404"), and the more command can be use to display the results page by page:

bart:/usr/local/inprise/ias41/logs/bart/webpageservice > grep 404 access_log | more
203.16.206.56 - - [31/Aug/2000:15:42:54 +1000] "GET /servlet/LibraryCatalog?command=mainmenu HTTP/1.1" 200 21404
203.16.206.56 - - [01/Sep/2000:08:32:12 +1000] "GET /servlet/LibraryCatalog?command=searchbyname HTTP/1.1" 200 14041
203.16.206.237 - - [01/Sep/2000:09:20:35 +1000] "GET /images/L
INE.gif HTTP/1.1" 404 1204
203.16.206.236 - - [01/Sep/2000:10:10:35 +1000] "GET /images/L
INE.gif HTTP/1.1" 404 1204
203.16.206.236 - - [01/Sep/2000:10:10:40 +1000] "GET /images/L
INE.gif HTTP/1.1" 404 1204
203.16.206.236 - - [01/Sep/2000:10:10:47 +1000] "GET /images/L
INE.gif HTTP/1.1" 404 1204
203.16.206.236 - - [01/Sep/2000:10:11:09 +1000] "GET /images/L
INE.gif HTTP/1.1" 404 1204
203.16.206.236 - - [01/Sep/2000:10:11:40 +1000] "GET /images/L
INE.gif HTTP/1.1" 404 1204
203.16.206.236 - - [01/Sep/2000:10:11:44 +1000] "GET /images/L
INE.gif HTTP/1.1" 404 1204
203.16.206.236 - - [01/Sep/2000:10:12:03 +1000] "GET /images/L
INE.gif HTTP/1.1" 404 1204
203.16.206.41 - - [01/Sep/2000:12:04:22 +1000] "GET /data/books/576586955.pdf H
TTP/1.0" 404 1204
--More--

These log files contain a line for each access to the web server, with entries relating to the source IP address, date and time of access, the HTTP request string sent, the protocol used, and the success/error code. When you see the --More-- prompt, the SPACEBAR can be pressed to advance to the next screen, or the ENTER key can be pressed to advance by a single line in the results. As you have probably guessed, the pipeline operator | was used to pass the results of the grep command through to the more command.

In addition to the pipeline, there are four other operators that can be used on the command line to direct or append input streams to standard output, or output streams to standard input. Although that sounds convoluted, it can be very useful when working with files to direct the output of a command into a new file (or append it to an existing file). Alternatively, the input to a command can be generated from the output of another command. These operations are performed by the following operators:

  • > Redirect standard output to a file.

  • >> Append standard output to a file.

  • < Redirect file contents to standard input.

  • << Append file contents to standard input.

Bash also has logical operators, including the 'less than' (lt) operator, which uses the test facility to make numerical comparisons between two operands. Other commonly used operators include

  • a -eq b a equals b.

  • a -ne b a not equal to b.

  • a -gt b a greater than b.

  • a -ge b a greater than or equal to b.

  • a -le b a less than or equal to b.

Let's look at an example with the cat command, which displays the contents of files, and the echo command, which echoes the contents of a string or an environment variable that has been previously specified. For example, imagine if we wanted to maintain a database of endangered species in a text file called animals.txt. If we wanted to add the first animal "zebra" to an empty file, we could use this command:

# echo "zebra" > animals.txt

We could then check the contents of the file animals.txt with the following command:

# cat animals.txt
zebra

Thus, the insertion was successful. Now, imagine that we want to add a second entry (the animal 'emu') to the animals.txt file. We could try using the command

# echo "emu" > animals.txt

but the result may not be what we expected:

# cat animals.txt
emu

This is because the > operator always overwrites the contents of an existing file, while the >> operator always appends to the contents of an existing file. Let's run that command again with the correct operators:

# echo "zebra" > animals.txt
# echo "emu" > animals.txt

Luckily, the output is just what we expected:

# cat animals.txt
zebra
emu

Once we have a file containing a list of all the animals, we would probably want to sort it alphabetically, making searching for specific entries easy. To do this, we can use the sort command:

# sort animals.txt
emu
zebra

The sorted entries are then displayed on the screen in alphabetical order. It is also possible to redirect the sorted list into another file (called sorted_animals.txt) by using this command:

# sort animals.txt > animals_sorted.txt

If you wanted to check that the sorting process actually worked, you could compare the contents of the animals.txt file line by line with the sorted_animals.txt file by using the diff command:

# diff animals.txt sorted_animals.txt
1d0
< zebra
2a2
> zebra

This result indicates that the first and second lines of the animals.txt and sorted_animals.txt files are different, as expected. If the sorting process had failed, the two files would have been identical, and no differences would have been reported by diff.

A related facility is the basename facility, which is designed to remove file extensions from a filename specified as an argument. This is commonly used to convert files with one extension to another extension. For example, let's imagine that we had a graphic file conversion program that took as its first argument the name of a source JPEG file and took the name of a target bitmap file. Somehow, we'd need to convert a filename of the form filename.jpg to a file of the form filename.bmp. We can do this with the basename command. In order to strip a file extension from an argument, we need to pass the filename and the extension as separate arguments to basename. For example, the command

# basename maya.gif .gif

will produce the following output:

maya

If we want the .gif extension to be replaced by a .bmp extension, we could use

# echo `basename maya.gif .gif`.bmp

which will produce the following output:

maya.bmp

Of course, we are not limited to extensions like .gif and .bmp. Also, keep mind that the basename technique is entirely general-and since Solaris does not have mandatory filename extensions, the basename technique can be used for other purposes, such as generating a set of strings based on filenames.



Part I: Solaris 9 Operating Environment, Exam I