17.4 Optimization

"First make it work. Then make it right. Then make it fast." This quotation, often with slight variations, is widely known as the golden rule of programming. As far as I've been able to ascertain, the quotation is attributed to Kent Beck, who credits his father with it. Being widely known makes the principle no less important, particularly because it's more honored in the breach than in the observance. A negative form, slightly exaggerated for emphasis, is in a quotation by Don Knuth: "Premature optimization is the root of all evil in programming."

Optimization is premature if your code is not working yet. First make it work. Optimization is also premature if your code is working but you are not satisfied with the overall architecture and design. Remedy structural flaws before worrying about optimization: first make it work, then make it right. These first two steps are not optionalworking, well-architected code is always a must.

In contrast, you don't always need to make it fast. Benchmarks may show that your code's performance is already acceptable after the first two steps. When performance is not acceptable, profiling often shows that all performance issues are in a small subset, perhaps 10% to 20% of the code where your program spends 80% or 90% of the time. Such performance-crucial regions of your code are also known as its bottlenecks, or hot spots. It's a waste of effort to optimize large portions of code that account for, say, 10% of your program's running time. Even if you made that part run 10 times as fast (a rare feat), your program's overall runtime would only decrease by 9%, a speedup no user will even notice. If optimization is needed, focus your efforts where they'll matter, on bottlenecks. You can optimize bottlenecks while keeping your code 100% pure Python. In some cases, you can resort to recoding some computational bottlenecks as Python extensions, potentially gaining even better performance.

17.4.1 Developing a Fast-Enough Python Application

Start by designing, coding, and testing your application in Python, often using some already available extension modules. This takes much less time than it would take with a classic compiled language. Then benchmark the application to find out if the resulting code is fast enough. Often it is, and you're donecongratulations!

Since much of Python itself is coded in highly optimized C, as are many of its standard and extension modules, your application may even turn out to be already faster than typical C code. However, if the application is too slow, you need to re-examine your algorithms and data structures. Check for bottlenecks due to application architecture, network traffic, database access, and operating system interactions. For typical applications, each of these factors is more likely than language choice to cause slowdowns. Tinkering with large-scale architectural aspects can often speed up an application dramatically, and Python is an excellent medium for such experimentation.

If your program is still too slow, you should profile it to find out where the time is going. Applications often exhibit computational bottleneckssmall areas of the source code, generally between 10% and 20%, which account for 80% or more of the running time. You can now optimize the bottlenecks, applying the techniques suggested in the rest of this chapter.

If normal Python-level optimizations still leave some outstanding computational bottlenecks, you can recode them as Python extension modules, as covered in Chapter 24. In the end, your application will run at roughly the same speed as if you had coded it all in C, C++, or Fortranor faster, when large-scale experimentation has let you find a better architecture. Your overall programming productivity with this process is not much less than if you coded everything in Python. Future changes and maintenance are easy, since you use Python to express the overall structure of the program, and lower-level, harder-to-maintain languages only for a few specific computational bottlenecks.

As you produce applications in a given area according to this process, you will accumulate a library of reusable Python extension modules for that area. You therefore become more and more productive at developing other fast-running Python applications in the same field.

Even if external constraints should eventually force you to recode the whole application in a lower-level language, you're still better off for having started in Python. Rapid prototyping has long been acknowledged as the best way to get a software architecture just right. A working prototype lets you check that you have identified the right problems and taken the best path to their solution. A prototype affords the kind of large-scale architectural experimentation that can make a real difference to performance. Starting your prototype with Python allows a gradual migration to other languages by way of extension modules. The application remains in a fully functional and testable state at each stage. This ensures against the risk of compromising a design's architectural integrity in the coding stage. The resulting software is likely to be faster and more robust than if all of the coding had been lower-level from the start, and your productivity, while not quite as good as with a pure Python or mostly Python application, is still better than if you had been coding at a lower level throughout.

17.4.2 Benchmarking

Benchmarking is similar to system testing: both activities are like running the program as it's meant to be run for production purposes. In both cases, you need to have at least some subset of the program's intended functionality working, and you need to use known, reproducible inputs. In the case of benchmarking, you don't need to capture and check your program's output: since you make it work and make it right before you make it fast, you are already confident about your program's correctness by the time you benchmark it. You do need inputs that are representative of typical system operations, particularly those that may be most challenging for your program's performance. If your program performs several kinds of operations, make sure you run one or two benchmarks for each different kind of operation.

Elapsed time as measured by your wristwatch is probably precise enough to benchmark most programs. Programs with hard real-time constraints are obviously another matter, but they have needs very different from those of normal programs in most respects. A 5% or 10% difference in performance, except for programs with very peculiar constraints, makes no practical difference to a program's real-life usability.

When you benchmark "toy" programs in order to help you choose an algorithm or data structure, you may need more precision. In that case, you may want to set up an artificial environment, with a machine as quiescent as possible, no network activity, and accurate timekeeping. Python time operations are covered in Chapter 12. The benchmarking discussed in this section is a different kind of issue: an approximation of real-life program operation, for the sole purpose of checking whether the program's performance at each task is acceptable, before embarking on profiling and other optimization activities. For such system benchmarking, a situation that approximates the program's normal operating conditions is best, and accuracy in timing is not particularly important.

17.4.3 Large-Scale Optimization

The aspects of your program that are most important for performance are large-scale ones: choice of algorithms, overall architecture, and choice of data structures.

The performance issues that you must almost always take into account are those connected with the traditional big-O notation of computer science. Informally, if you call N the input size of an algorithm, big-O notation expresses algorithm performance, for large values of N, as proportional to a function of N (in precise computer science lingo, this should actually be called big-Theta, but in practical use programmers in the field call this big-O). An O(N) algorithm is one where, for large enough N, handling twice as much data takes about twice as much time, three times as much data three times as much time, and so on, growing linearly with N. An O(N²) algorithm is one where, for large enough N, handling twice as much data takes about four times as much time, three times as much data nine times as much time, and so on, growing with N squared.

You will find more information on big-O notation, as well as other issues about algorithms and their complexity, in any good book about algorithms and data structures. Unfortunately, at the time of this writing, there aren't yet any such books using Python. However, if you are at least moderately familiar with C, I can recommend Mastering Algorithms with C, by Kyle Loudon (O'Reilly).

To understand the practical importance of big-O considerations in your programs, consider two different ways to accept all items from an input iterator and accumulate them into a list in reverse order:

def slow(it):
    result = [  ]
    for item in it: result.insert(0, item)
    return result

def fast(it):
    result = [  ]
    for item in it: result.append(item)
    result.reverse(  )
    return result

We could express each of these functions more concisely, but the key difference is best appreciated by presenting them in these elementary terms. Function slow builds the result list by inserting each input item before all previously received ones. Function fast appends each input item after all previously received ones, then reverses the result list just before returning it. Intuitively, one might think that the final reversing represents extra work, and therefore slow should be faster than fast. But that's not the way things work out.

Each call to result.append takes roughly the same amount of time, independent of how many items are already in list result, since there is always a free slot for an extra item at the end of the list. The for loop in function fast executes N times to receive N items. Since each iteration of the loop takes a constant time, overall loop time is O(N). result.reverse also takes time O(N), as it is directly proportional to the total number of items. Thus, the total running time of fast is also O(N). (If you don't understand why a sum of two quantities, each O(N), is also O(N), consider that the sum of two linear functions of N is also a linear function of N).

In contrast, each call to result.insert must make space at slot 0 for the new item to insert, by moving all items that are already in list result forward one slot. That takes a time proportional to the number of items that are already in the list. The overall amount of time to receive N items is therefore proportional to 1+2+3+...N-1, a sum whose value is O(N²). Therefore, the total running time of slow is also O(N²).

It's almost always worth replacing an O(N²) solution with an O(N) one, unless you can somehow assign rigorous limits to the input size N. If N can grow without bounds, the O(N²) solution will inevitably turn out to be disastrously slower than the O(N) one for large enough values of N, no matter what the proportionality constants in each case may be (and no matter what profiling tells you). Unless you have other O(N²) or even worse bottlenecks elsewhere that you cannot eliminate, a part of the program that is O(N²) will inevitably turn into the program's bottleneck and dominate runtime for large enough values of N. Do yourself a favor and watch out for the big O: all other performance issues, in comparison, are insignificant.

Incidentally, function fast can be made substantially faster by expressing it in more idiomatic Python. Just replace the first two lines with the single statement:

result = list(it)

This change does not affect fast's big-O character (fast is still O(N) after the change), but does speed things up by a constant factor. Often, in Python, the simplest, clearest, most idiomatic way to express something is also the fastest.

Choosing algorithms with good big-O characteristics is roughly the same task in Python as in any other language. You just need a few indications about the big-O performance of Python's elementary building blocks.

17.4.3.1 List operations

Python lists are internally implemented with vectors (also known as arrays), not with linked lists. This fundamental implementation choice determines just about all performance characteristics of Python lists, in big-O terms.

Chaining two lists of length N1 and N2 is O(N1+N2). Multiplying a list of length N by the number M is O(N*M). Accessing or rebinding any list item is O(1) (also known as constant time, meaning that the time taken does not depend on how many items are in the list). len( ) on a list is also O(1). Accessing any slice of length M is O(M). Rebinding a slice of length M with one of identical length is also O(M). Rebinding a slice of length M1 with one of different length M2 is O(M1+M2+N1), where N1 is the number of items after the slice in the target list.

Most list methods, as shown way back in Table 4-3, are equivalent to slice rebindings and have the same big-O performance. Methods count, index, remove, and reverse, and operator in, are O(N). Method sort is generally O(N*log(N)), but has optimizations that let it be O(N) in some important special cases, like when the list is already sorted, reverse sorted, or sorted except for a few items at the end (in Python 2.3, sort will also be O(N) in a few more important special cases). range(a,b,c) is O((b-a)/c). xrange(a,b,c) is O(1), but looping on xrange's result is O((b-a)/c).

17.4.3.2 String operations

Most methods on a string of length N (be it plain or Unicode) are O(N). len(astring) is O(1). The fastest way to produce a copy of a string with transliterations and/or removal of specified characters is the string's method translate. The most practically important big-O consideration involving strings is covered in Section 17.4.5 later in this chapter.

17.4.3.3 Dictionary operations

Python dictionaries are internally implemented with hash tables. This fundamental implementation choice determines just about all performance characteristics of Python dictionaries, in big-O terms.

Accessing, rebinding, adding, or removing a dictionary item is generally O(1), as are methods has_key, get, setdefault, and popitem, and operator in. d1.update(d2) is O(len(d2)). len(adict) is O(1). Methods keys, items, and values are O(N). Methods iterkeys, iteritems, and itervalues are O(1), but looping on the iterators that those methods return is O(N). When the keys in a dictionary are instances of classes that define _ _hash_ _ and equality comparison methods, dictionary performance is of course affected by those methods. The indications presented in this paragraph are valid only if both hashing and equality comparison are O(1).

17.4.4 Profiling

Most programs have hot spots (i.e., regions of source code that account for most of the time elapsed during a program run). Don't try to guess where your program's hot spots are; programmers' intuition is notoriously unreliable in this field. Use module profile to collect profile data over one or more runs of your program, with known inputs. Then, use module pstats to collate, interpret, and display that profile data. To gain accuracy, you can calibrate the Python profiler for your machine (i.e., determine what overhead profiling incurs on your machine). Module profile can then subtract this overhead from the times it measures so that the profile data you collect is closer to reality.

17.4.4.1 The profile module

The profile module supplies one function you will often use.

run

run(code,filename=None)

code is a string such as you could use with statement exec, normally a call to the main function of the program you're profiling. filename is the path of a file that run creates or rewrites with profile data. Usually you call run a few times, specifying different filenames, and possibly different arguments to your program's main function, in order to exercise various program parts proportionately. Then, you use module pstats to display collated results.

You may call run without a filename to obtain a summary report, similar to the one module pstats could give you, directly on standard output. However, this approach gives no control at all over output format, nor does it offer any way to consolidate several runs into one report. In practice, you rarely use this feature: collecting profile data into files is generally preferable.

Module profile also supplies class Profile, mentioned in the next section. By instantiating Profile directly, you can access advanced functionality, such as the ability to run a command in specified local and global dictionaries. I do not cover such advanced functionality of class profile.Profile further in this book.

17.4.4.2 Calibration

To calibrate profile for your machine, you need to use class Profile, which module profile supplies and internally uses in function run. An instance p of Profile supplies one method you use for calibration.

calibrate

p.calibrate(N)

Loops N times, then returns a number that is the profiling overhead per call on your machine. N must be large if your machine is fast. Call p.calibrate(10000) a few times and check that the various numbers it returns are very close to each other, then pick the smallest one of them. If the numbers exhibit substantial variation, try again with larger values of N.

The calibration procedure can be time consuming. However, you need to perform it only once, repeating it only when you make changes that could alter your machine's characteristics, such as applying patches to your operating system, adding memory, or changing Python version. Once you know your machine's overhead, you can tell profile about it each time you import it, right before using profile.run. The simplest way to do this is as follows:

import profile
profile.Profile.bias = ...the overhead you measured...
profile.run('main(  )', 'somefile')

17.4.4.3 The pstats module

The pstats module supplies a single class, Stats, that you use to analyze, consolidate, and report on the profile data contained in one or more files written by function profile.run.

Stats

class Stats(filename,*filenames)

Instantiates Stats with one or more filenames of files of profile data written by function profile.run.

An instance s of class Stats provides methods to add profile data and sort and output results. Each method returns s, so you can chain several calls in the same expression. s's main methods are as follows.

add

s.add(filename)

Adds another file of profile data to the set that s is holding for analysis.

print_callees, print_callers

s


.print_callees(*restrictions)

Outputs the list of functions in s's profile data, sorted according to the latest call to s.sort_stats, and subject to the given restrictions, if any. You can call each printing method with zero or more restrictions, which are applied one after the other, in order, to reduce the number of output lines. A restriction that is an integer n limits the output to the first n lines. A restriction that is a floating-point value f between 0.0 and 1.0 limits the output to a fraction f of the lines. A restriction that is a string is compiled as a regular expression (as covered in Chapter 9); only lines satisfying a search method call on the regular expressions are output. Restrictions are cumulative. For example, s.print_calls(10,0.5) outputs the first 5 lines (half of 10). Output restrictions apply only after the summary and header lines: summary and header are output unconditionally.

Each function f that is output is accompanied by the list of f's callers (the functions that called f) or f's callees (the functions that f called) according to the name of the method.

print_stats

s.print_stats(*restrictions)

Outputs statistics about s's profile data, sorted according to the latest call to s.sort_stats, and subject to the given restrictions, if any, as covered in print_callees. After a few summary lines (date and time on which profile data was collected, number of function calls, and sort criteria used), the output, absent restrictions, is one line per function, with six fields per line, labeled in a header line. For each function f, print_stats outputs six fields:

Total number of calls to function f
Total time spent in function f, exclusive of other functions that f called
Total time per call (i.e., field 2 divided by field 1)
Cumulative time spent in function f, and in all functions directly or indirectly called from f
Cumulative time per call (i.e., field 4 divided by field 1)
The name of function f

sort_stats

s.sort_stats(key, *keys)

Gives one or more keys (primary first, if more than one) on which to sort future output. Each key is a string. The sort is descending for keys indicating times or numbers, alphabetical (ascending) for key 'nfl'. The most frequently used keys when calling sort_stats are:

'calls': Number of calls to the function (like field 1 covered in print_stats)
'cumulative': Cumulative time spent in the function and all functions it called (like field 4 i covered in print_stats)
'nfl': Name of the function, its module, line number of the function in its file (like field 6 covered in print_stats)
'time': Total time spent in the function itself, exclusive of functions it called (like field 2 covered in print_stats)

strip_dirs

s.strip_dirs(  )

Alters s by stripping directory names from all the module names that s holds, to make future output more compact. s is unsorted after s.strip_dirs( ), and therefore you normally call s.sort_stats with the arguments you desire right after calling s.strip_dirs.

17.4.5 Small-Scale Optimization

Fine tuning of program operations is rarely important. Such tuning may make a small but meaningful difference in some particularly hot spot, but hardly ever is it a decisive factor. And yet, such fine tuning, in the pursuit of mostly irrelevant microefficiencies, is where a programmer's instincts are likely to lead. It is in good part because of this that most optimization is premature and best avoided. The most that can be said in favor of fine tuning is that, if one idiom is always speedier than another when the difference is measurable, it's worth getting into the habit of always using the former and not the latter.

Most often, in Python, if you do what comes naturally and choose simplicity and elegance, you end up with code that has good performance as well as clarity and maintainability. In a few cases, an approach that may not be intuitive offers performance advantages, as discussed in the rest of this section.

The simplest possible optimization is to run your Python programs using python -O or -OO. -OO makes little direct difference to performance compared to -O, but -OO may save memory, since it removes docstrings from the bytecode, and memory availability is sometimes (indirectly) a performance bottleneck. The optimizer is not very powerful in current releases of Python, but it may still gain you performance advantages on the order of 10%, sometimes as large as 20% (potentially even larger, if you make heavy use of assert statements and if _ _debug_ _: guards as suggested in Chapter 6). The best aspect of -O is that it costs nothingas long as your optimization isn't premature, of course. -O does impede use of debuggers, such as pdb, and may thus make debugging somewhat harder if your program isn't fully tested and working correctly. So, don't use -O on a program you're still developing.

17.4.5.1 Building up a string from pieces

The single Python anti-idiom that's likeliest to kill your program's performance, to the point that you should never use it, is to build up a large string from pieces by looping on string concatenation statements such as big_string+=piece. Since Python strings are immutable, such a concatenation makes Python free the M bytes previously allocated for big_string, and allocate and fill M+K bytes for the new version. Doing this repeatedly in a loop, you end up with roughly O(N²) performance, where N is the total number of characters. More often than not, O(N²) performance where O(N) is available is a performance disaster. On some platforms, things may be even bleaker due to memory fragmentation effects caused by freeing many memory areas, all of different sizes, and allocating progressively larger ones.

To achieve O(N) performance, accumulate intermediate pieces in a list rather than building up the string piece by piece. Lists, unlike strings, are mutable, so appending to a list has O(1) performance (amortized). Change each occurrence of big_string+=piece into temp_list.append(piece). Then, when you're done accumulating, use the following to build your desired string result in O(N) time:

big_string = ''.join(temp_list)

Other O(N) ways to build up big strings are to concatenate the pieces to an instance of array.array('c'), or to write the pieces to an instance of cStringIO.StringIO.

In the special case where you want to output the resulting string, you may gain a further small slice of performance by using writelines on temp_list (never building big_string in memory). When feasible (i.e., when you have the output file object open and available in the loop), it's at least as effective to perform a write call for each piece, without any accumulation.

Although not nearly as crucial as += on a big string in a loop, another case where removing string concatenation may give a slight performance improvement is when you're concatenating several values in an expression:

oneway = str(x)+' eggs and '+str(y)+' slices of '+k+' ham'
another = '%s eggs and %s slices of %s ham' % (x, y, k)

Using operator % for string formatting is often a good performance choice.

17.4.5.2 Searching and sorting

Operator in, the most natural tool for searching, is O(1) when the right-hand side operand is a dictionary, but O(N) when the right-hand side operand is a list. If you need to perform many searches on a container, you're generally much better off using a dictionary, rather than a list, as the container. Python dictionaries are highly optimized for searching and fetching items by key.

Method sort of Python lists is also a highly optimized and sophisticated tool. You can rely on sort's performance. Performance dramatically degrades, however, if you pass sort a custom callable to perform comparisons in order to sort a list based on anything but built-in comparisons. To satisfy such needs, consider using the decorate-sort-undecorate (DSU) idiom instead. This idiom has the following steps:

decorate: Build an auxiliary list A where each item is a tuple made up of the sort keys, ending with the item of the original list L or with the item's index
sort: Call A.sort( ) without arguments
undecorate: Extract the items in order from the now-sorted A

The decorate and undecorate steps are most often handily performed with list comprehensions. If you need the sort to be in-place, assign the final sorted list to L[:]. Otherwise, DSU provides a sorted copy, without disturbing the original list L.

For example, say we have in L a large list of strings, each of at least two words, and we want to sort L in-place by the second word of each string:

A = [ (s.split(  )[1], s) for s in L ]
A.sort(  )
L[:] = [ t[1] for t in A ]

This is much faster than passing to L.sort a function that compares two strings by their second words, as in:

def cmp2ndword(a, b): return cmp(a.split(  )[1], b.split(  )[1])
L.sort(cmp2ndword)

On a series of benchmarks with Python 2.2 on lists of 10,000 strings, I measured the DSU version as 7 to 10 times faster than the non-DSU one.

17.4.5.3 Avoiding exec and from ... import *

If a function contains an exec statement without explicit dictionaries, the whole function slows down substantially. The presence of such an exec statement forces the Python compiler to avoid the modest but precious optimizations it normally performs because such an exec might cause any alteration at all to the function's namespace. A from statement of the form:

from MyModule import *

causes similar performance loss, since it, too, can alter a function's namespace unpredictably.

exec itself is also quite slow, particularly if you apply it to a string of source code rather than to a code object. By far the best approach, for performance, for correctness, and for clarity, is to avoid exec altogether. It's most often possible to find better (faster, more solid, and clearer) solutions. If you must use exec, always use it with explicit dictionaries. If you need to exec a dynamically obtained string more than once, compile the string one time and repeatedly exec the resulting code object.

eval works on expressions, not on statements; therefore, although it's still slow, at least it avoids some of the worst performance impacts of exec. With eval, too, you're best advised to use explicit dictionaries, and, if you need repeated evaluation of the same dynamically obtained string, compile the string just once, then repeatedly eval the resulting code object.

17.4.5.4 Optimizing loops

Most of your program's bottlenecks will be in loops, particularly nested loops, because loop bodies often execute repeatedly. Python does not implicitly perform any code hoisting: if you have any code inside a loop that might be executed just once by hoisting it out of the loop, and the loop is a performance bottleneck, hoist the code out yourself. Sometimes the presence of code to hoist may not be immediately obvious:

def slower(anobject, ahugenumber):
    for i in xrange(ahugenumber): anobject.amethod(i)
def faster(anobject, ahugenumber):
    themethod = anobject.amethod
    for i in xrange(ahugenumber): themethod(i)

In this case, the code that faster hoists out of the for loop is the attribute lookup anobject.amethod. slower repeats the lookup each and every time, while faster performs it just once. The two functions are not 100% equivalent: it is (just barely) conceivable that executing amethod might cause such changes on anobject that the next lookup for the same named attribute fetches a different method object. This is part of why Python doesn't perform such optimizations itself. In practice, such subtle, obscure, and tricky cases happen far less than one time in ten thousand. So you're quite safe performing such optimizations yourself, when you're trying to squeeze the last drop of performance out of some crucial bottleneck.

It's faster for Python to use local variables than global ones. So, if one of your loops is repeatedly accessing a global variable whose value does not change between iterations of the loop, put the value in a local variable and have the loop access the local variable instead. This also applies to built-in functions:

def slightly_slower(asequence, adict):
    for x in asequence: adict[x] = hex(x)
def slightly_faster(asequence, adict):
    myhex = hex
    for x in asequence: adict[x] = myhex(x)

Here, the speedup is very modest, on the order of 5% or so.

Do not cache None. None is currently an ordinary built-in identifier, but it is scheduled to become a keyword in Python 2.3 or 2.4, so no further optimization will be needed.

List comprehensions can be faster than loops, and so can map and filter. For optimization purposes, try changing loops into list comprehensions or map and filter calls where feasible. However, the performance advantage of map and filter is nullified if you have to use a lambda or an extra level of function call. Only when you pass to map or filter a built-in function, or a function you'd have to call anyway even from an explicit loop, do you stand to gain.

The loops that you can replace most naturally with list comprehensions, or map and filter calls, are loops that build up a list by repeatedly calling append on the list. In such cases, if you know in advance the length of the resulting list, a further optimization is available. Predefine the result list to the right length (e.g., with result=[None]*N), introduce an explicit index i that starts at 0 and grows by one at each iteration of the loop, and change each call to result.append(x) into result[i]=x. The following example shows this optimization in the context of a typical microperformance benchmark script:

import time

def slow(asequence):
    result = [  ]
    for x in asequence: result.append(-x)
    return result

def middling(asequence):
    return [ -x for x in asequence ]

def fast(asequence):
    result = [None]*len(asequence)
    for i in xrange(len(asequence)): result[i] = -asequence[i]
    return result

biggie = xrange(500*1000)
tentimes = [None]*10
def timit(afunc):
    lobi = biggie
    start = time.clock(  )
    for x in tentimes: afunc(lobi)
    stend = time.clock(  )
    return "%-10s: %.2f" % (afunc._ _name_ _, stend-start)

for afunc in slow, middling, fast, fast, middling, slow:
    print timit(afunc)

Running this example with python -O (on a PC with a 1.2 GHz Athlon CPU, Python 2.2.1) shows fast taking 4.30 seconds, middling 4.81 to 4.84 seconds, and slow 6.50 to 7.02 seconds, on Windows 98. The time ranges on Linux are 4.19- 4.20, 5.15-5.20, and 6.91-7.00, respectively. With the current alpha version of Python 2.3 on Linux, the time ranges are 3.35-3.37 for fast, 4.61-4.64 for middling, and 6.43-6.44 for slow. In summary, on this machine, slow is 35%-40% slower than middling, and middling is about 15%-25% slower than fast (and Python 2.2 is 10%-25% slower than the current alpha of Python 2.3).

17.4.5.5 Optimizing I/O

If your program does substantial amounts of I/O, it's likely that performance bottlenecks are due to I/O, not to computation. Such programs are said to be I/O bound, rather than CPU bound. Your operating system tries to optimize I/O performance, but you can help it in a couple of ways. One such way is to perform your I/O in chunks of a size that is optimal for performance, rather than simply being convenient for your program's operations. Another way is to use threading.

From the point of view of a program's convenience and simplicity, the ideal amount of data to read or write at a time is generally small (one character or one line) or very large (an entire file at a time). That's often okay, because Python and your operating system work behind the scenes to let your program use convenient logical chunks for I/O, while arranging physical I/O operations with chunk sizes that are more attuned to performance. Reading and writing a whole file at a time is quite likely to be okay for performance as long as the file is not inordinately large. Specifically, file-at-a-time I/O is fine as long as the file's data fits in your machine's physical memory, leaving enough physical memory available for your program and operating system to perform whatever other tasks they need to perform at the same time. The hard problems of I/O-bound program performance tend to come with huge files.

If performance is an issue, don't use a file object's readline method, which is limited in the amount of chunking and buffering it can perform. Using writeline, on the other hand, gives no performance problem when that method is the one most convenient for your program. Loop directly on the file object (in Python 2.2) to get one line at a time with the best performance. If the file isn't too huge, time two versions of your program, one that loops directly on the file object and one that calls method readlines, which reads the whole file into memory. Either solution may prove faster. In Python 2.1, you can't loop directly on the file object. Instead, use method xreadlines in a for loop. xreadlines will be deprecated in Python 2.3, but if you need top performance in this specific case and need to support Python 2.1, there is no alternative.

For binary files, specifically large binary files of whose contents you need just a part on each run of your program, module mmap, covered in Chapter 14, can often give you both good performance and program simplicity.

Making an I/O-bound program multithreaded may sometimes afford substantial performance gains if you can arrange your program's architecture accordingly. Start a few worker threads devoted exclusively to I/O, have the computational threads request I/O operations from the I/O threads via Queue instances, and try to post the request for each input operation as soon as you know you'll eventually need that data. Performance will increase only if there are other tasks your computational threads can perform while an I/O thread is blocked waiting for data. Basically, you get better performance this way if you can manage to overlap computation and waiting for data, by having different threads do the computing and the waiting. See Chapter 14 for detailed coverage of Python threading and a suggested architecture.