5.1 Envisioning an Object

The Rebase project provides a set of files that specify restriction enzymes, their cut sites, and a great deal more information. Consider the problem of designing an object-oriented version of code that uses this data. What will be the objects and the methods?

Each restriction enzyme has a name; associated with its name are the definition of its recognition site (which I'll translate into a Perl regular expression), information about the chemistry of the restriction enzyme, vendors of the enzyme, and other annotation. This information is all part of the Rebase database.

Perhaps I should consider each restriction enzyme as a suitable candidate for my basic object. I can then read in the Rebase database, creating objects for each restriction enzyme that includes such attributes as the recognition site, the translation of the recognition site into a Perl regular expression, and whatever additional annotation I find useful.

With such objects, I can associate methods that take as their arguments sequence data and return the list of locations in which that particular enzyme has a recognition site in the sequence. Sounds good, let's start coding!

But wait. What happens if, as is often the case, you want to find multiple restriction enzymes in a sequence and display the resulting map. With my design, you'd have to find the object associated with each restriction enzyme, pass it to the sequence, collect the locations, and then combine the individual lists of locations in order to display the map. This can be slow (finding the right objects, one for each restriction enzyme) and inconvenient (combining the output of the various methods from the various objects).

You recognize this questioning as an essential step in program design?thinking about the problem and considering alternative ways to write code that solve it. I reprise the idea here because, so far, I've been simply seeing and discussing solutions. Although it's neat and tidy, it isn't really the way programming works. Programming often involves thinking of alternative program strategies, comparing them, coding the most promising alternatives as prototypes and testing them (i.e., benchmarking), and finally deciding on an approach to implement.

So, in that spirit, what alternatives come to mind to the one enzyme/one object approach just described? The Rebase database is essentially a key/value lookup database, in which the key is the enzyme name. The value is the recognition site or annotation: actually there are several datafiles provided in the database. But I'm most interested in getting the recognition site, translating it to a Perl regular expression, and reporting on the locations in some sequence data. A nice interface to display some of the annotation of the restriction enzyme would also be useful.

Any key/value type of data immediately brings the hash data structure to the mind of the Perl programmer. As you know from my introduction to object-oriented programming, the hash data structure is also the most useful way to implement an object.

So, perhaps instead of many objects, one for each restriction enzyme, you may want to consider one object that provides the fast lookup of a value (the recognition site and regular expression) for each key (the name of the restriction enzyme). Clearly, this can be implemented as a hash. Other attributes can hold the sequence and the map as an array of the positions in the sequence in which the recognition sites exist. Methods for the object could extract the site, the regular expression, and perhaps some annotation, for each enzyme. A method can also locate the recognition sites for an enzyme in the sequence.

If we go that way, how will we manage the actual restriction maps that are made? A restriction map has as input some sequence and a list of restriction enzymes, and has as output a list of the locations where the enzymes have recognition sites in the sequence. Should there be another kind of object, a Restriction object, that has attributes of sequence, enzyme names, and locations of recognition sites?

Perhaps we can use the SeqFileIO class from Chapter 4 as a base class for a new derived class that adds attributes for restriction maps on the sequence.

That might be possible, but it combines file manipulations with restriction mapping and seems, at best, a shotgun wedding.

So, after careful reflection, consultation with colleagues, a lab meeting, pressure from the PI, an opinion from an outside expert, and some quick and dirty Perl scripts to see some alternatives in action, a decision is reached. We'll make a big Rebase object to hold the enzyme/recognition site data, plus a new Restriction object that holds the sequence and the locations of the recognition sites (the "map"). The class will provide the methods needed to calculate a restriction map.

One of the considerations that led to this decision was that, at some point, it will be necessary to graphically display the restriction map; an object that contains the sequence and the map (the locations of the recognition sites in the sequence) will be well suited for adding some graphics capabilities.

Finally, in this chapter we'll use the Restriction class as a base class to develop a Restrictionmap class that does have some graphics capabilities.

For more discussion of how to design the component parts of this software development project, see the exercises at the end of the chapter.