10.6 Compressed Files

Although storage space and transmission bandwidth are increasingly cheap and abundant, in many cases you can save such resources, at the expense of some computational effort, by using compression. Since computational power grows cheaper and more abundant even faster than other resources, such as bandwidth, compression's popularity keeps growing. Python makes it easy for your programs to support compression by supplying dedicated modules for compression as part of every Python distribution.

10.6.1 The gzip Module

The gzip module lets you read and write files compatible with those handled by the powerful GNU compression programs gzip and gunzip. The GNU programs support several compression formats, but module gzip supports only the highly effective native gzip format, normally denoted by appending the extension .gz to a filename. Module gzip supplies the GzipFile class and an open factory function.

GzipFile

class GzipFile(filename=None,mode=None,compresslevel=9,
               fileobj=None)

Creates and returns a file-like object f that wraps the file or file-like object fileobj. f supplies all methods of built-in file objects except seek and tell. Thus, f is not seekable: you can only access f sequentially, whether for reading or writing. When fileobj is None, filename must be a string that names a file: GzipFile opens that file with the given mode (by default, 'rb'), and f wraps the resulting file object. mode should be one of 'ab', 'rb', 'wb', or None. If mode is None, f uses the mode of fileobj if it is able to find out the mode; otherwise it uses 'rb'. If filename is None, f uses the filename of fileobj if able to find out the name; otherwise it uses ''. compresslevel is an integer between 1 and 9: 1 requests modest compression but fast operation, and 9 requests the best compression feasible, even if that requires more computation.

File-like object f generally delegates all methods to the underlying file-like object fileobj, transparently accounting for compression as needed. However, f does not allow non-sequential access, so f does not supply methods seek and tell. Moreover, calling f.close does not close fileobj when f was created with an argument fileobj that is not None. This behavior of f.close is very important when fileobj is an instance of StringIO.StringIO, since it means you can call fileobj.getvalue after f.close to get the compressed data as a string. This behavior also means that you have to call fileobj.close explicitly after calling f.close.

open

open(filename,mode='rb',compresslevel=9)

Like GzipFile(filename,mode,compresslevel), but filename is mandatory and there is no provision for passing an already opened fileobj.

Say that you have some function f(x) that writes data to a text file object x, typically by calling x.write and/or x.writelines. Getting f to write data to a gzip-compressed text file instead is easy:

import gzip
underlying_file = open('x.txt.gz', 'wb')
compressing_wrapper = gzip.GzipFile(fileobj=underlying_file, mode='wt')
f(compressing_wrapper)
compressing_wrapper.close(  )
underlying_file.close(  )

This example opens the underlying binary file x.txt.gz and explicitly wraps it with gzip.GzipFile, and thus, at the end, we need to close each object separately. This is necessary because we want to use two different modes: the underlying file must be opened in binary mode (any translation of line endings would produce an invalid compressed file), but the compressing wrapper must be opened in text mode because we want the implicit translation of os.linesep to \n. Reading back a compressed text file, for example to display it on standard output, is similar:

import gzip, xreadlines
underlying_file = open('x.txt.gz', 'rb')
uncompressing_wrapper = gzip.GzipFile(fileobj= underlying_file, mode='rt')
for line in xreadlines.xreadlines(uncompressing_wrapper):
    print line,
uncompressing_wrapper.close(  )
underlying_file.close(  )

This example uses module xreadlines, covered earlier in this chapter, because GzipFile objects (at least up to Python 2.2) are not iterable like true file objects, nor do they supply an xreadlines method. GzipFile objects do supply a readlines method that closely emulates that of true file objects, and therefore module xreadlines is able to produce a lazy sequence that wraps a GzipFile object and lets us iterate on the GzipFile object's lines.

10.6.2 The zipfile Module

The zipfile module lets you read and write ZIP files (i.e., archive files compatible with those handled by popular compression programs zip and unzip, pkzip and pkunzip, WinZip, and so on). Detailed information on the formats and capabilities of ZIP files can be found at http://www.pkware.com/appnote.html and http://www.info-zip.org/pub/infozip/. You need to study this detailed information in order to perform advanced ZIP file handing with module zipfile.

Module zipfile can't handle ZIP files with appended comments, multidisk ZIP files, or .zip archive members using compression types besides the usual ones, known as stored (when a file is copied to the archive without compression) and deflated (when a file is compressed using the ZIP format's default algorithm). For invalid .zip file errors, functions of module zipfile raise exceptions that are instances of exception class zipefile.error. Module zipfile supplies the following classes and functions.

is_zipfile

is_zipfile(filename)

Returns True if the file named by string filename appears to be a valid ZIP file, judging by the first few bytes of the file; otherwise returns False.

ZipInfo

class ZipInfo(filename='NoName',date_time=(1980,1,1,0,0,0))

Methods getinfo and infolist of ZipFile instances return instances of ZipInfo to supply information about members of the archive. The most useful attributes supplied by a ZipInfo instance z are:

comment: A string that is a comment on the archive member
compress_size: Size in bytes of the compressed data for the archive member
compress_type: An integer code recording the type of compression of the archive member
date_time: A tuple with 6 integers recording the time of last modification to the file: the items are year, month, day (1 and up), hour, minute, second (0 and up)
file_size: Size in bytes of the uncompressed data for the archive member
filename: Name of the file in the archive

ZipFile

class ZipFile(filename,mode='r',compression=zipfile.ZIP_STORED)

Opens a ZIP file named by string filename. mode can be 'r', to read an existing ZIP file; 'w', to write a new ZIP file or truncate and rewrite an existing one; or 'a', to append to an existing file.

When mode is 'a', filename can name either an existing ZIP file (in which case new members are added to the existing archive) or an existing non-ZIP file. In the latter case, a new ZIP file-like archive is created and appended to the existing file. The main purpose of this latter case is to let you build a self-unpacking .exe file (i.e., a Windows executable file that unpacks itself when run). The existing file must then be a fresh copy of an unpacking .exe prefix, as supplied by www.info-zip.org or by other purveyors of ZIP file compression tools.

compression is an integer code that can be either of two attributes of module zipfile. zipfile.ZIP_STORED requests that the archive use no compression, and zipfile.ZIP_DEFLATED requests that the archive use the deflation mode of compression (i.e., the most usual and effective compression approach used in .zip files).

A ZipFile instance z supplies the following methods.

close

z.close(  )

Closes archive file z. Make sure the close method is called, or else an incomplete and unusable ZIP file might be left on disk. Such mandatory finalization is generally best performed with a try/finally statement, as covered in Chapter 6.

getinfo

z.getinfo(name)

Returns a ZipInfo instance that supplies information about the archive member named by string name.

infolist

z.infolist(  )

Returns a list of ZipInfo instances, one for each member in archive z, in the same order as the entries in the archive itself.

namelist

z.namelist(  )

Returns a list of strings, the names of each member in archive z, in the same order as the entries in the archive itself.

printdir

z.printdir(  )

Outputs a textual directory of the archive z to file sys.stdout.

read

z.read(name)

Returns a string containing the uncompressed bytes of the file named by string name in archive z. z must be opened for 'r' or 'a'. When the archive does not contain a file named name, read raises an exception.

testzip

z.testzip(  )

Reads and checks the files in archive z. Returns a string with the name of the first archive member that is damaged, or None when the archive is intact.

write

z.write(filename,arcname=None,compress_type=None)

Writes the file named by string filename to archive z, with archive member name arcname. When arcname is None, write uses filename as the archive member name. When compress_type is None, write uses z's compression type; otherwise, compress_type is zipfile.ZIP_STORED or zipfile.ZIP_DEFLATED, and specifies how to compress the file. z must be opened for 'w' or 'a'.

writestr

z.writestr(zinfo,bytes)

zinfo must be a ZipInfo instance specifying at least filename and date_time. bytes is a string of bytes. writestr adds a member to archive z, using the metadata specified by zinfo and the data in bytes. z must be opened for 'w' or 'a'. When you have data in memory and need to write the data to the ZIP file archive z, it's simpler and faster to use z.writestr rather than z.write. The latter approach would require you to write the data to disk first, and later remove the useless disk file. The following example shows both approaches, each encapsulated into a function, polymorphic to each other:

import zipfile
def data_to_zip_direct(z, data, name):
    import time
    zinfo = zipfile.ZipInfo(name, time.localtime(  )[:6])
    z.writestr(zinfo, data)
def data_to_zip_indirect(z, data, name):
    import os
    flob = open(name, 'wb')
    flob.write(data)
    flob.close(  )
    z.write(name)
    os.unlink(name)
zz = zipfile.ZipFile('z.zip', 'w', zipfile.ZIP_DEFLATED)
data = 'four score\nand seven\nyears ago\n'
data_to_zip_direct(zz, data, 'direct.txt')
data_to_zip_indirect(zz, data, 'indirect.txt')
zz.close(  )

Besides being faster and more concise, data_to_zip_direct is handier because, by working in memory, it doesn't need to have the current working directory be writable, as data_to_zip_indirect does. Of course, method write also has its uses, but that's mostly when you already have the data in a file on disk, and just want to add the file to the archive. Here's how you can print a list of all files contained in the ZIP file archive created by the previous example, followed by each file's name and contents:

import zipfile
zz = zipfile.ZipFile('z.zip')
zz.printdir(  )
for name in zz.namelist(  ):
    print '%s: %r' % (name, zz.read(name))
zz.close(  )

10.6.3 The zlib Module

The zlib module lets Python programs use the free InfoZip zlib compression library (see http://www.info-zip.org/pub/infozip/zlib/), Version 1.1.3 or later. Module zlib is used by modules gzip and zipfile, but the module is also available directly for any special compression needs. This section documents the most commonly used functions supplied by module zlib.

Module zlib also supplies functions to compute Cyclic-Redundancy Check (CRC) checksums, in order to detect possible damage in compressed data. It also provides objects that can compress and decompress data incrementally, and thus enable you to work with data streams that are too large to fit in memory at once. For such advanced functionality, consult the Python library's online reference.

Note that files containing data compressed with zlib are not automatically interchangeable with other programs, with the exception of files that use the zipfile module and therefore respect the standard format of ZIP file archives. You could write a custom program, with any language able to use InfoZip's free zlib compression library, in order to read files produced by Python programs using the zlib module. However, if you do need to interchange compressed data with programs coded in other languages, I suggest you use modules gzip or zipfile instead. Module zlib may be useful when you want to compress some parts of data files that are in some proprietary format of your own, and need not be interchanged with any other program except those that make up your own application.

compress

compress(str,level=6)

Compresses string str and returns the string of compressed data. level is an integer between 1 and 9: 1 requests modest compression but fast operation, and 9 requests compression as good as feasible, thus requiring more computation.

decompress