Cost-Based Optimizer

Star transformation requires the cost-based optimizer. That's all there is to it. Fortunately, the statistics collection method required is also quite simple. For both the dimensions and facts, simply run the following:

[View full width]
ANALYZE TABLE table_name ESTIMATE STATISTICS FOR TABLE FOR ALL INDEXES FOR ALL INDEXED 
COLUMNS SAMPLE 20,000 ROWS

We run this command to accomplish two goals. First, we must provide the optimizer enough information so that it knows that star transformation is valid. Second, we must provide the optimizer enough information so that it knows absolutely never to utilize any b-tree indexes. You could even add histograms to your analysis via the SIZE parameter, but this would only assist with the latter goal (as bitmap indexes do not use histograms). Of course, you could gather statistics using the DBMS_STATS package instead of the ANALYZE command (often preferable since DBMS_STATS offers parallel analysis capabilities); but, the end results would be the same.

I've generally only received three questions regarding statistics:

Q: Why sample only 20,000 rows and not 5% or a million rows?

I've found that 20,000 rows will yield the same results (generally) as taking much larger samples. And, a 20,000-row sample will analyze rather quickly. Forget using percentages. For example, one percent of a billion rows is 10 million rows, and a serial analyze that size will run quite a long time. Plus, it will not yield any better results. So why analyze more than you need to? You can get star transformation with just 20,000 rows.

Q: Is it better to analyze by partition?

Yes. I would merely adjust the above command and do it for each partition. The reason is simple: If you analyze at the partition level, the optimizer should do a much better job of identifying correct partition eliminations in your star transformation explain plans. That's worth the additional cost to gather the statistics.

Q: How often should statistics be gathered?

That's easy: every time you load data. If you load data weekly, then gather statistics once per week. If daily, then do it daily. By keeping the sample size small, even multi-terabyte data warehouses can re-analyze all their tables in a very brief time.