6.1 One Perl, Many Databases

There comes a time when disk files or the simple DBM hash database (that you've seen in previous chapters) just won't manage the data of a medium- or large-size project, and you must turn to relational databases. Although they take quite a bit more effort to set up and to program, they offer a standard and reliable way to store data and to ask questions about it.

There are two things that make relational databases standard. For one thing, they all follow a certain model of data structures, the relational model. These data structures have become a fixture in the computing world; they combine a level of constraint and flexibility that has proved its usefulness in many areas, including bioinformatics.

Almost all relational databases are programmed with a programming language called the Structured Query Language, or SQL. This is a fairly simple language that creates, populates, queries, and manages the kind of data structures relational databases provide. The combination of a standard data structure with a standard programming language is another reason relational databases have become so successful.

One thing that's not standard is the proliferation of relational database companies and their penchant for doing things their own way. This may sometimes be a marketing decision, but it's more often the natural process of evolution?of different sets of programmers having different ideas and making different implementations.^[1]

^[1] When I first released software that used Perl for bioinformatics, I received a letter arguing that because C was available everywhere and Perl wasn't, Perl for bioinformatics was therefore a Bad Idea and I should use C instead. Of course, Perl is available everywhere now, including on the VMS systems that my correspondent was complaining about, and bioinformatics software is written in a variety of languages. He made the classic mistake of wanting to standardize a field long before it had settled down.

This is important when you have some working database application that uses a particular DBMS such as Oracle, and you find that you have to port the application to work on another DBMS such as MySQL. Perhaps another database system has become significantly faster or cheaper, or your computer is replaced with a new one that supports a different database, or your computer center or CIO decrees that some new DBMS is now the mandatory standard. If your database application makes extensive use of a feature that is available only on your old DBMS, you'll have a lot of work ahead of you rewriting your software to make it work on the new DBMS.

Luckily, thanks to some expert Perl programming, there is a way to get around this proliferation of different DBMS with their special ways of doing things and their special extensions of SQL. In this chapter, I'll use the Perl DBI (DataBase Independent) module that provides a common interface to different relational database systems; it makes it possible to write SQL that will run on many different relational database systems with little or no change.

Still, unless you are subject to a decree, the problem that the Perl bioinformatics programmer faces at the beginning of a project is, "Which relational database system should I use?" It depends on the computer you're on and what DBMS is already in use, available, paid for, or known locally. There are very expensive systems, and there are free ones: we'll take a quick look at some of the alternatives and use one of the most popular free ones for the following examples.

The beginning programmer should be aware that relational databases are a large field of endeavor. Stop at any local bookstore with a good computer book section and you'll see an impressive number of books dedicated to relational databases and SQL in general, and especially dedicated to working with specific relational database management systems such as Oracle, SQL Server (the database, not the language), Sybase, MySQL, etc. There are books devoted to specific tools for designing a database, managing a database system, and programming a database system. In the workplace, there are job titles and positions for people who specialize in these three areas, and more.

So, don't expect this one chapter to reveal all. Do expect it to explain the basic concepts, give you the lay of the land, and demonstrate a practical example you can use as a template as you begin to develop your own code. At the very least, you will want access to the documentation for the particular database system you will use in your own work.

Having given the obligatory warning, I'll also add that, unless you are tackling a fairly large project, relational databases aren't all that difficult to use. If they were, they wouldn't be so popular. You may well spend a lot of your programming time in the future dealing with databases, or you may just spend a little. If you expect it will be a lot, I recommend you do some further reading.

Relational databases are an important topic. Because they have their own software systems, language, and concepts, they are a bit of an additional challenge. Most bioinformaticians need to know the basics of how to use them; some specialize in them. This chapter will get you started.