Chapter 6. Loading the Warehouse

Loading data should be the easiest part of data warehousing, right? Well, it's not. In fact, 90% of the problems in data warehouses for which I've been the production support DBA have been with the nightly batch cycles that load the data. In other words, when the beeper goes off at 2 a.m. four nights a week, it's usually because some data load batch job missed its "must start by" or "must complete by" time and therefore royally screwed up the remaining jobs for that cycle. The nightly paging was so bad on one project that my wife asked me to leave a perfectly good job because she needed her sleep (as it was affecting her job performance). Now that's serious!

So why is this true? Every developer who's ever worked with Oracle has had to load data at some point. And, loading data is not rocket science. But remember, we're talking about a data warehouse, where size does matter, and in a big way (no pun intended). A typical, non-aggregate fact table may require the loading of tens to hundreds of millions of rows per day. You cannot write inefficient data loading programs when dealing with that much raw data. But, the typical developer has not had to deal with such staggering sizing issues in his or her primarily OLTP-based experience. Furthermore, most Oracle developers tend to write record-oriented code (i.e., using cursors), which does not make effective use of multi-CPU machines. Hence, even very good developers generally produce inefficient data loading programs at first. Thus, the production support DBA often must educate and inspire them regarding the techniques in this chapter. The best method that I've found to date is to call the responsible developers every time you get paged at night, which seems to make the point both quickly and convincingly. Plus, it just feels darn good to share the pain.

Even after you have created efficient data loading programs, there is still one more reason that data loading will represent the majority of your production support problems: concurrent job mixture and dependencies. Often, finding just the right execution order for dependent jobs and job sets on your existing hardware for your permitted time schedule is like finding a needle in a haystack. It's not uncommon to hold regular team meetings just to review and modify batch job schedules based on the most recent execution experiences. And as your data warehouse adds new data load jobs over time, these meetings also provide an excellent forum and foundation during which to request hardware upgrades.