Logical Partitioning Design

The first thing you must decide is how you'll want to partition your fact tables. In 90% of the cases I've seen, the DBA simply chooses to implement a sliding window of data over time, with individual partitions each representing a reasonable time slice of the data over that time (as shown in Figure 8-2). We'll refer to this as simple partitioning.

Figure 8-2. Simple Range Partitioning Based Upon Time

graphics/08fig02.jpg

Note that this is not a technical design decision, but really much more of a business requirement. What I mean is that your end-users will tell you to keep N time periods of data online and how often to load or update it. The only technical issue is how to best support archiving data as it logically falls off the trailing edge.

If you were not partitioned by time, then you'd have to use a very slow DML command such as DELETE FROM fact WHERE period_id < cut off value to remove old records. Not only would that take an extremely long time, but it also could unbalance b-tree indexes and fragment tablespaces. A much better solution is to simply drop off the expired partition using the ALTER TABLE fact DROP PARTITION x command.

As an example of end-user input, the business requirements may mandate keeping 60 months (i.e., 5 years) of data, which is updated nightly. So if the DBA chooses to partition by month, then at most, the database will have 61 partitions, each of which must hold approximately 30 days' worth of data. If the nightly cycle loads 20 million records, then a fact partition will hold approximately 600 million rows. If, on the other hand, the DBA chooses to partition by week, then at most, the database will have 261 partitions, each of which will hold 7 days' worth of data, or about 140 million rows.

Which is better? It all depends, but in general, the more granular the partition the better. The reason is that a few hundred million rows is a reasonable figure for DDL operations such as creating an index, and smaller, more granular partitions should not exceed this size. Furthermore, the logic to handle weekly partitions is much simpler since all weeks have exactly the same number of days. Monthly partitions have a huge drawback in that the logic to handle them must handle the various months' day counts, including the often-forgotten leap years, where February has 29 versus the normal 28 days. It's much better to keep partition management logic simple and partition size small (relatively speaking, of course).

There are two other less obvious reasons why smaller partitions work better. First, for queries and DML that must access a partition, smaller partitions can lead to a higher degree of parallel operations, especially with Oracle 8i, which generally does not support parallel operations within a partition. So, having more partitions is a manual method for forcing potentially higher degrees of parallel operations. However, this situation no longer exists with Oracle 9i. Thus, I would not recommend adopting an approach that has already been addressed, especially since 9i is in its second release already.

A second and less important reason for more, smaller partitions is that they more naturally support logical aggregation options. For example, week partitions very naturally and easily summarize into weeks, months, quarters, and years. But larger partitions such as months do not. While this is not a show-stopper in terms of whether or not you should do it, I would recommend considering this within the context of all the other issues you may need to balance. In other words, it might be useful as a tie-breaker.

As I said, 90% of DBAs just do the above. However, I've seen about 10% take it a step further. They partition first by time and then sub-partition along some other criteria, either dimensionally based or specially designed for their needs. Essentially, the idea is to subdivide time partitions into additional and even smaller sub-partitions (as shown in Figure 8-3). We'll refer to this as complex partitioning

Figure 8-3. Composite Partitioning Based Upon Time and Something Else

graphics/08fig03.jpg

One of the chief reasons DBAs do this is to achieve parallel joins between facts and dimensions, where the dimension is partitioned along the same criteria as the fact. Oracle refers to this as partition-wise joins, and they are very useful for parallel server environments. The problem is that very few of the DBAs I've seen implementing this feature have parallel servers. Again, they most often do it for theoretically improving query runtimes. And as before, while it may speed up queries somewhat, it's only the star transformation that really matters. This would just be another "icing on the cake" improvement.

Another reason I see DBAs doing sub-partitioning is the belief that they can achieve sub-partition elimination, and thus again speed up queries. The problem is that more often than not, this may actually make things worse. Let me explain. Oracle 8i offers range partitions that can be sub-partitioned by hashing, whereas Oracle 9i now offers range partitions that can be sub-partitioned by lists or hashing. Either way, you must fully understand the underlying nature of your data to make effective use of these features. The sub-partitioning scheme must conform to the nature of your data, or you may actually makes things much worse.

For example, I tried hashed sub-partitions with 7-Eleven's data warehouse. I partitioned by range on my period identifier, making each partition contain a week's worth of data. Then I hash-sub-partitioned on my product identifier. The idea was that similar products would hash into the same sub-partitions, thus queries on classes of products (such as beer) would only reference those sub-partitions. Sounds good, right? But, I had overlooked the nature of my data. The product identifiers were surrogate or meaningless keys and thus evenly distributed across the entire product identifier domain. Therefore, hashing merely spread my data equally across all the sub-partitions and required an extra level of sub-partition operations to obtain the exact same data. So, my queries took twice as long. Needless to say, I went back to just range partitioning without sub-partitions.