Hack 61 Downloading Files from the Command Line

Few Mac users know of the utility named curl, shipped with every 10.2 Macintosh, or of the easily installed wget. Both allow you to download from the command line ? and with a little magic to boot.

There are hundreds of ways to download files located on the Net ? FTP, HTTP, NNTP, Gnutella, Hotline, Carracho, the list of possible options goes on and on. There is, however, an odd man out in these protocols, and that's HTTP. Most web browsers are designed to view web pages (as you'd expect); they're not designed to download mass amounts of files from a public web directory. This often leaves users with a few meager choices: should they manually and slowly download each file themselves or go out and find some software that could do it for them?

With OS X, your answer comes in the form of free software allowing you to download from the command line [Hack #48] ? one installed by default, and one obtainable through Fink (http://fink.sf.net/) [Hack #58]. Investigating the preinstalled utility makes it sounds innocent enough:

curl is a client to get documents/files from or send docu-
ments to a server, using any of the supported protocols
(HTTP, HTTPS, FTP, GOPHER, DICT, TELNET, LDAP or FILE).
The command is designed to work without user interaction
or any kind of interactivity.

Further reading through its manual (accessible by entering man curl as a shell command or a slightly longer version with curl --manual) shows a wide range of features, including the ability to get SSL documents, manipulate authentication credentials, change the user agent, set cookies, and prefill form values with either GET or POST. Sadly, curl has some shortcomings, and they all revolve around downloading files that don't have similar names.

Almost immediately, the manual instructs you of curl's range power, so you can download a list of sequentially numbered files with a simple command:

% curl -LO http://www.example.com/file[0-100].txt

The -L flag tells curl to follow any redirects that may be issued, and the -O flag will save the downloaded files into similarly named copies locally (./file0.txt, ./file1.txt, etc.). Our limitations with the range feature show all too clearly with date-based filenames. Say I want to download a list of files that are in the form of yymmdd.txt. I could use this innocent command:

% curl -LO http://www.example.com/text/[1996-2002]/[000001-999999].txt

If you are patient enough, this will work fine. The downside is that curl will literally try to grab 900,000 files per year (which would range from 1996 through 2002). While a patient downloader may not care, that will create an insane amount of bandwidth waste, as well as a potentially angry web host. We could split the previous command into two:

% curl -LO http://www.example.com/text/[1996-1999]/[96-99][01-12][01-31].txt
% curl -LO http://www.example.com/text/[2000-2002]/[00-02][01-12][01-31].txt

These will also work correctly, at the expense of being lengthy (technically, we could combine the curl commands into one, with two URLs) and still causing a large number of "file not found" errors for the web host (albeit not as many as the first one).

Solving this sort of problem can be done easily with a freely available utility called wget, which used to ship with earlier versions of OS X (Apple replaced it with curl). You can install it again quite easily with Fink [Hack #58]. With wget, we simply enter the following:

% wget -m -A txt -np http://www.example.com/text/

We start off in mirror mode (-m), which allows us to run the command at a later date and grab only content that has changed from what we've previously downloaded. We accept (-A) only files that end in .txt, and we don't want to get anything from our parent directory (-np or no parent); this stops wget from following links that lead us out of the text directory. wget (as well as curl) will show you a running progress as it's downloading files. More information about wget is available by typing man wget on the command line.