Fact Table Options

Fact tables are huge. They are so huge that it should come as no surprise that much of your success will depend on how you implement your facts. Thus, I'm offering as general DBA advice my best Joe Friday from Dragnet imitation: "Just the facts …," meaning if you get just the facts done right, you're well on the road to success.

But to do this right, you must look beyond just size. At any given time, fact tables will fall into one of three states:

  • Being queried (covered in Chapter 5)

  • Being loaded (covered in Chapters 6 and 7)

  • Being managed (covered in Chapter 8)

The key point is that to be successful, fact table implementations must accommodate the different requirements of all three states. Far too often, DBAs concentrate on just one or two of these states, and end up with a sub-optimal solution. Or worse yet, they apply misguided "common sense" rather than empirical evidence in making their selections.

Let's start by identifying our basic options for implementing fact tables:

  • Non-partitioned, heap-organized

  • Range-partitioned, heap-organized

  • List-partitioned, heap-organized

  • Hash-partitioned, heap-organized

  • Composite range, hash-partitioned, heap-organized

  • Composite range, list-partitioned, heap-organized

  • Non-partitioned, index-organized

  • Range-partitioned, index-organized

For each of these implementation choices, you have the following additional choices for whether the table's indexes are:

  • Non-partitioned

  • Partitioned locally

  • Partitioned globally

Finally, both the tables and indexes can be either:

  • Logging versus no logging

  • Parallel versus non-parallel

  • Compressed versus non-compressed

That makes for one heck of a lot of choices. Do you really know which is best by merely picking from these lists? I sure don't. But if we reexamine these choices with the proper importance weightings from our three states, the process becomes much easier.

So which of the three states would you label most important? This is actually the most critical distinction in making the proper implementation selection. More often than not, the naturally proposed answer is the query state. But for the moment (or at least until you read Chapter 5), trust me when I say that I can get ultra-fast queries with near identical runtimes utilizing any of these options, as long as I get the right explain plan. So now which state becomes most important?

The second most often proposed answer is the loading state. The idea is that unless you partition, you cannot drop indexes, load data, and recreate indexes in a timely fashion. But this is a data management response to what is being termed a loading question. What I mean is that dropping and recreating indexes quickly has more to do with determining partition size, which is really a data management issue. For example, I could choose to partition my tables and indexes, but only create two partitions per object. This would only cut the index creation time in half, so it's really not a great choice. Therefore, this has nothing to do with the data loading architecture.

If you have not guessed by now, it's the management state that should drive your fact table implementation selection. First, most companies do not have an infinite budget for their data warehouses, so at some point, data must be archived. For example, at 7-Eleven, we kept 60 months of data online, with the plan to archive by month once we had reached our limit.

Now, if you choose the simple, non-partitioned, heap-organized table, how would you archive data? The answer is: with a very slow and painful delete command. Moreover, you'd screw up your non-partitioned b-tree index structures, requiring a potential rebuild. So for archival purposes, you should partition by some time dimension criteria such as week or month. That of course would also aid your loading process index drop and build since you'd operate only those few, smaller partitions that were current.

Okay, so partitioned it is (covered in Chapter 8), but which one? The answer is simple. First, let's agree that index-organized tables (IOTs) are great for OLTP lookup scenarios, but make a poor choice for huge fact tables. Trust me; don't go this way unless you like lots of elevator music?because you'll be calling Oracle support more than is necessary. Second, let's agree that lots of small date range buckets could be implemented with equal ease using either range or list partitioning. That cuts our choices in half to just:

  • Range-partitioned, heap-organized

  • List-partitioned, heap-organized

  • Composite range, hash-partitioned, heap-organized

  • Composite range, list-partitioned, heap-organized

Now, remember back in Chapter 3, "Hardware Architecture," when we discussed hardware, specifically the number of CPUs and degree of parallel operations permitted? Well, a lot of DBAs assume that they must sub-partition to get the highest degree of parallel queries possible. But, there are two flaws with this premise. First, does your data lend itself to sub-partitioning? For example, I tried hashed sub-partitions with 7-Eleven's data warehouse. I partitioned by range on my period identifier, making each partition contain a week's worth of data. Then I hash sub-partitioned on my product identifier. The idea was that similar products would hash into the same sub-partitions, thus queries on classes of products (such as beer) would only reference those sub-partitions. Sounds good, right? But I had overlooked the nature of my data. The product identifiers were evenly distributed across the entire product identifier domain. Thus, hashing merely spread my data equally across all the sub-partitions and required an extra level of sub-partition operations to obtain the exact same data. So, my queries took twice as long. Needless to say, I went back to just partitions without sub-partitions.

Second, does your hardware have sufficient CPU bandwidth to handle the extra parallel operations permitted by sub-partitioning? If you have 32 processors, but also have 100 people concurrently running reports, then using the parallel feature at the table level is probably already overkill. But I see this same mistake at over half the data warehouses I visit. The DBA assumes that more than one CPU mandates turning on parallel queries. But unless you have more CPUs than your concurrent report load, the parallel feature is a loaded gun waiting to go off in your face. Use it judiciously. Thus, sub-partitioning for an added level of parallel operations is not the "slam dunk" people automatically think it is.

So, our partitioning options (again, covered in detail in Chapter 8) are really quite simple:

  • Range-partitioned, heap-organized

  • List-partitioned, heap-organized

As always, the best choice will depend on your specific requirements, skill set, comfort level, and the nature of your data.