Table Statistics

You've seen all the operators that PostgreSQL can use to execute a query. Remember that the goal of the optimizer is to find the plan with the least overall expense. Each operator uses a different algorithm for estimating its cost of execution. The cost estimators need some basic statistical information to make educated estimates.

Table statistics are stored in two places in a PostgreSQL database: pg_class and pg_statistic.

The pg_class system table contains one row for each table defined in your database (it also contains information about views, indexes, and sequences). For any given table, the pg_class.relpages column contains an estimate of the number of 8KB pages required to hold the table. The pg_class.reltuples column contains an estimate of the number of tuples currently contained in each table.

Note that pg_class holds only estimates?when you create a new table, the relpages estimate is set to 10 pages and reltuples is set to 1000 tuples. As you INSERT and DELETE rows, PostgreSQL does not maintain the pg_class estimates. You can see this here:


movies=# SELECT * FROM tapes;

 tape_id  |     title     | dist_id

----------+---------------+---------

 AB-12345 | The Godfather |       1

 AB-67472 | The Godfather |       1

 MC-68873 | Casablanca    |       3

 OW-41221 | Citizen Kane  |       2

 AH-54706 | Rear Window   |       3

(5 rows)



movies=# CREATE TABLE tapes2 AS SELECT * FROM tapes;

SELECT

movies=# SELECT reltuples, relpages FROM pg_class

movies-#   WHERE relname = 'tapes2';

 reltuples | relpages

-----------+----------

      1000 |       10

Create the tapes2 table by duplicating the tapes table. You know that tapes2 really holds five tuples (and probably requires a single disk page), but PostgreSQL has not updated the initial default estimate.

There are three commands that you can use to update the pg_class estimates: VACUUM, ANALYZE, and CREATE INDEX.

The VACUUM command removes any dead tuples from a table and recomputes the pg_class statistical information:


movies=# VACUUM tapes2;

VACUUM

movies=# SELECT reltuples, relpages FROM pg_class WHERE relname = 'tapes2';

 reltuples | relpages

-----------+----------

         5 |        1

(1 row)

The pg_statistic system table holds detailed information about the data in a table. Like pg_class, pg_statistic is not automatically maintained when you INSERT and DELETE data. The pg_statistic table is not updated by the VACUUM or CREATE INDEX command, but it is updated by the ANALYZE command:


movies=# SELECT staattnum, stawidth, stanullfrace FROM pg_statistic

movies-#   WHERE starelid =

movies-#     (

movies(#       SELECT oid FROM pg_class WHERE relname = 'tapes2'

movies(#     );

 staattnum | stawidth | stanullfrac

-----------+----------+-------------

 (0 rows)



movies=# ANALYZE tapes;

ANALYZE



movies=# SELECT staattnum, stawidth, stanullfrace FROM pg_statistic

movies-#   WHERE starelid =

movies-#     (

movies(#       SELECT oid FROM pg_class WHERE relname = 'tapes2'

movies(#     );

 staattnum | stawidth | stanullfrac

-----------+----------+-------------

         1 |       12 |           0

         2 |       15 |           0

         3 |        4 |           0

(3 rows)

PostgreSQL defines a view (called pg_stats) that makes the pg_statistic table a little easier to deal with. Here is what the pg_stats view tells us about the tapes2 table:


movies=# SELECT attname, null_frac, avg_width, n_distinct FROM pg_stats

movies-#   WHERE tablename = 'tapes2';

 attname | null_frac | avg_width | n_distinct

---------+-----------+-----------+------------

 tape_id |         0 |        12 |         -1

 title   |         0 |        15 |       -0.8

 dist_id |         0 |         4 |       -0.6

(3 rows)

You can see that pg_stats (and the underlying pg_statistics table) contains one row for each column in the tapes2 table. The null_frac value tells you the percentage of rows where a given column contains NULL. In this case, there are no NULL values in the tapes2 table, so null_frac is set to 0 for each column. avg_width contains the average width (in bytes) of the values in a given column. The n_distinct value tells you how many distinct values are present for a given column. If n_distinct is positive, it indicates the actual number of distinct values. If n_distinct is negative, it indicates the percentage of rows that contain a distinct value. A value of ?1 tells you that every row in the table contains a unique value for that column.

pg_stats also contains information about the actual values in a table:


movies=# SELECT attname, most_common_vals, most_common_freqs

movies-#   FROM pg_stats

movies-#   WHERE tablename = 'tapes2';

 attname | most_common_vals  | most_common_freqs

---------+-------------------+-------------------

 tape_id |                   |

 title   | {"The Godfather"} | {0.4}

 dist_id | {1,3}             | {0.4,0.4}

(3 rows)

The most_common_vals column is an array containing the most common values in a given column. The most_common_freqs value tells you how often each of the most common values appear. By default, ANALYZE stores the 10 most common values (and the frequency of those 10 values). You can increase or decrease the number of common values using the ALTER TABLE ... SET STATISTICS command.

Another statistic exposed by pg_stat is called histogram_bounds:


movies=#  SELECT attname, histogram_bounds FROM pg_stats

movies-#   WHERE tablename = 'tapes2';

 attname |                histogram_bounds

---------+------------------------------------------------

 tape_id | {AB-12345,AB-67472,AH-54706,MC-68873,OW-41221}

 title   | {Casablanca,"Citizen Kane","Rear Window"}

 dist_id |

(3 rows)

The histogram_bounds column contains an array of values for each column in your table. These values are used to partition your data into approximately equally sized chunks.

The last statistic stored in pg_stats is an indication of whether the rows in a table are stored in column order:


movies=# SELECT attname, correlation FROM pg_stats

movies-#   WHERE tablename = 'tapes2';

 attname | correlation

---------+-------------

 tape_id |         0.7

 title   |        -0.5

 dist_id |         0.9

(3 rows)

A correlation of 1 means that the rows are sorted by the given column. In practice, you will see a correlation of 1 only for brand new tables (whose rows happened to be sorted before insertion) or tables that you have reordered using the CLUSTER command.



    Part II: Programming with PostgreSQL