Extended Statistics in Postgres

Using extended statistics helps improve performance for queries in postgres.

Page content

Something you probably already know about Postgres: The planner needs to know things about the data to come up with a good access plan. For example, it needs to decide whether to use an index or a sequential scan. It’s looking at things like row count or distinct values when it comes up with this plan.

The statistics it uses to do this are stored internally. One place is the pg_class table.

select relname, relpages, reltuples::bigint from pg_class where relname='census_blocks';
    relname    | relpages | reltuples
---------------+----------+-----------
 census_blocks |  1064938 |   7794920

Don’t get excited that this is a way to get the number of rows in a table without a long-running select count(*), because these statistics aren’t completely up-to-date. Especially for a table with a high rate of change, it can be outdated until the next vacuum or analyze table is run. But it’s a pretty good estimate for the planner to use.

In addition to the size of the table that is found by querying pg_class, statistics about the table values are found by querying pg_stats. For example, querying the “histogram bounds” field shows the bounding values for a column in each of 100 (by default) buckets.

select histogram_bounds from pg_stats where tablename='census_blocks' and attname='countyfp20';
                                                                                                                                                                    histogram_bounds
---------------------------------------------------------------------------------------------------------------------
 {006,010,020,030,078,102,150,180,185,185,186,187,187,187,191,191,193,193,193,199,199,199,203,203,207,207,209,209,213,213,215,215,215,217,219,221,225,225,229,231,233,235,241,245,245,249,251,257,265,273,277,287,291,297,303,303,307,309,313,323,329,331,339,339,347,355,355,361,363,367,375,381,387,395,401,409,419,423,427,441,445,451,457,463,467,477,479,481,485,491,491,499,505,550,595,683,710,760,770,810,840}

Something that might be new: Humans typically understand relationships within the data that the planner won’t know by looking at the simple row counts and ranges. Postgres gives us an option to tell the planner about these relationships by using the extended statistics feature.

I’ll use the HUD crosswalk data as an example. I’ve stored it in my postgres database with the following structure:

trade_area_db_2=# \d+ hud_crosswalk
                                         Table "public.hud_crosswalk"
   Column    |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
-------------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 zip         | text    |           | not null |         | extended |             |              |
 tract_geoid | text    |           | not null |         | extended |             |              |
 statefp     | text    |           | not null |         | extended |             |              |
 state       | text    |           |          |         | extended |             |              |
 city        | text    |           |          |         | extended |             |              |
 res_ratio   | numeric |           |          |         | main     |             |              |
 bus_ratio   | numeric |           |          |         | main     |             |              |
 oth_ratio   | numeric |           |          |         | main     |             |              |
 tot_ratio   | numeric |           |          |         | main     |             |              |
 year        | integer |           | not null |         | plain    |             |              |
 quarter     | integer |           | not null |         | plain    |             |              |
Indexes:
    "hud_crosswalk_pkey" PRIMARY KEY, btree (zip, tract_geoid, year, quarter)
    "hud_crosswalk_statefp_idx" btree (statefp)
    "hud_crosswalk_tract_idx" btree (tract_geoid)
    "hud_crosswalk_zip_idx" btree (zip)

Here is a sample row of data from this table:

zip         | 91605
tract_geoid | 06037122000
statefp     | 06
state       | CA
city        | NORTH HOLLYWOOD
res_ratio   | 0.08486121360049952
bus_ratio   | 0.29351883271123175
oth_ratio   | 0.06840796019900497
tot_ratio   | 0.11140176799567021
year        | 2025
quarter     | 4

Dependencies

The first option for setting extended statistics concerns dependencies: columns can be correlated (that is, not independent). If this is starting to sound like your high school stats class, you’re right! In the default case, you can think about the number of rows the table stats recorded for a given value as the probability of that value being in the table. So where zipcode='91605' is the same thing as P(zipcode=91605). You may remember that if you have two (independent) where statements, you multiply them to get the probability of both happening. If 10% of the rows meet the where zipcode='91605' and 10% of the rows meet the and state='CA' criteria, then the probably of both happening is 10% times 10%, or 1%.

But when ID1 and ID2 aren’t independent, then the real probability that where zipcode='91605' and and state='CA' are both true is greater than that 1%. If the planner doesn’t know what you know about dependence of the two columns, then it will underestimate the matching rows by an order of magnitude and potentially make a poor decision about how to access the data for the query.

For my dataset, I have some knowledge about my data that ZIP codes are correlated to states. I’ll create a statistic on these fields zip and state.

create statistics nestedloc (dependencies) on zip, state from hud_crosswalk ;

This just creates the placeholder for the statistic, and you need to run ANALYZE to populate the statistic.

analyze hud_crosswalk ;

Query the statistics data you created:

select stxname, stxkeys, stxddependencies
from pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
where stxname = 'nestedloc';

Result:

stxname          | nestedloc
stxkeys          | 1 4
stxddependencies | {"1 => 4": 1.000000}

(The stxkeys field shows the position of the column in the table definition, so for example, key 1 is zip.)

As expected the zip field is perfectly correlated with the state field. ZIP codes notoriously cross local geography boundaries, but not state boundaries.

As an example, there are 23 rows with zip='91605' in state='CA'. How well is the nestedloc statistic working for the planner? Let’s compare the number of estimated rows, both before:

explain select * from hud_crosswalk where state='CA' and zip='91605';
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on hud_crosswalk  (cost=4.63..164.43 rows=4 width=82)
   Recheck Cond: (zip = '91605'::text)
   Filter: (state = 'CA'::text)
   ->  Bitmap Index Scan on hud_crosswalk_zip_idx  (cost=0.00..4.62 rows=44 width=0)
         Index Cond: (zip = '91605'::text)

and after:

explain select * from hud_crosswalk where state='CA' and zip='91605';
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on hud_crosswalk  (cost=4.64..164.44 rows=44 width=82)
   Recheck Cond: (zip = '91605'::text)
   Filter: (state = 'CA'::text)
   ->  Bitmap Index Scan on hud_crosswalk_zip_idx  (cost=0.00..4.62 rows=44 width=0)
         Index Cond: (zip = '91605'::text)

Much closer, but you may have noticed that the explain shows estimated number of rows as 44 when there are only 23. One reason for this is that the analyze command that you ran after creating the statistic only samples the rows. The goal is to give the planner a more directionally true estimate, and in this case 44 is a better estimate than 4.

For more details on the math involved, get the recorded frequencies of each field value in the pg_stats view:

select most_common_vals, most_common_freqs
from pg_stats
where tablename = 'hud_crosswalk' and (attname='zip' or attname='state');

The results show that state='CA' has a frequency of 0.0837, zip='91605' has a frequency of 0.00023333334, and there are 189,361 total rows. These frequencies won’t be completely accurate, because they are based on the last time the analyze was run, and that analyze job is using a sampling process.

Without the dependencies statistic, we will multiply the frequencies of zip and state, so the estimated rows are 0.0837 x 0.00023333334 x 189,361 = 3.69 (4 estimated rows above).

With the dependencies statistic, estimated rows consider only zip frequency, or 0.00023333334 x 189,361 = 44.184 (44 estimated rows above).

Most Common Value

The second extended statistic that can be defined is the most common value or mcv. The most common value information is stored in the stats table for individual columns. Defining the mcv for two or more columns helps the planner choose the best plan based on the frequency of that combination when those columns are used together in the where clause.

This statistic makes the most sense when there is low cardinality of the columns in the definition, because the mcv list is by default limited to 100. The table used for this example is below:

                                      Table "public.products"
   Column    |         Type          | Collation | Nullable |               Default
-------------+-----------------------+-----------+----------+--------------------------------------
 id          | integer               |           | not null | nextval('products_id_seq'::regclass)
 category    | character varying(30) |           |          |
 subcategory | character varying(30) |           |          |
Indexes:
    "products_pkey" PRIMARY KEY, btree (id)

Here are the most common values and corresponding frequencies for the fields individually:

select most_common_vals, most_common_freqs from pg_stats where tablename = 'products' and attname='category';
               most_common_vals               |                  most_common_freqs
----------------------------------------------+------------------------------------------------------
 {electronics,clothing,furniture,sports,food} | {0.26746666,0.19946666,0.19273333,0.1723,0.16803333}

select most_common_vals, most_common_freqs from pg_stats where tablename = 'products' and attname='subcategory';

 {phones,sofas,fitness,shirts,laptops,snacks,pants,beverages,beds,cycling,shoes,team_sports,pantry,jackets,desks,chairs,tablets,running,dairy,dresses,headphones,tables,sportswear,produce,outdoor_sports,shelving,cameras,smartwatches,frozen,accessories,water_sports,wardrobes,golf,gaming,bakery,outdoor,underwear,winter_sports,specialty} | {0.14863333,0.058333334,0.050566666,0.049466666,0.045166668,0.044233333,0.04083333,0.039766666,0.039566666,0.034533333,0.033866666,0.0307,0.028733334,0.028033333,0.027866667,0.0242,0.023633333,0.022033334,0.021566667,0.0194,0.017633334,0.0175,0.0162,0.015533334,0.0148,0.0134333335,0.011966666,0.011266666,0.0107,0.010466667,0.0099,0.0077,0.0068,0.006666667,0.0046666665,0.0041333335,0.0037,0.0029666666,0.0028333333}

By creating an extended statistic of the mcv type, we can improve the row estimates for the planner. Before adding the statistic, for a result that has 2,747 rows:

explain select * from products where category='furniture' and subcategory='desks';
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Seq Scan on products  (cost=0.00..2132.00 rows=537 width=19)
   Filter: (((category)::text = 'furniture'::text) AND ((subcategory)::text = 'desks'::text))

Adding the statistic:

create statistics productmcv (mcv) on category, subcategory from products ;
analyze products;

After:

explain select * from products where category='furniture' and subcategory='desks';
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Seq Scan on products  (cost=0.00..2132.00 rows=2720 width=20)
   Filter: (((category)::text = 'furniture'::text) AND ((subcategory)::text = 'desks'::text))

Much closer estimate for our planner to use. And the math is calculated similarly to the first example.

NDistinct

The third type of statistic you can tell the planner about is ndistinct. This stat is most helpful when your query is grouped by more than one column. Just like with mcv, this value is stored for individual columns in the stats table. Storing the number of distinct values in combinations of columns will allow the planner to pick the better plan when grouping by those columns.

The actual count of the grouped rows is 3,011.

The estimate before is the multiplied counts of distinct state and distinct county so it is larger than the real value:

explain select statefp20, countyfp20, count(*) from census_blocks group by statefp20, countyfp20;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1124034.29..1128386.96 rows=16848 width=15)
   Group Key: statefp20, countyfp20
   ->  Gather Merge  (cost=1124034.29..1127965.76 rows=33696 width=15)
         Workers Planned: 2
         ->  Sort  (cost=1123034.26..1123076.38 rows=16848 width=15)
               Sort Key: statefp20, countyfp20
               ->  Partial HashAggregate  (cost=1121683.03..1121851.51 rows=16848 width=15)
                     Group Key: statefp20, countyfp20
                     ->  Parallel Seq Scan on census_blocks  (cost=0.00..1097363.73 rows=3242573 width=7)

Create the stat:

create statistics blocksnd (ndistinct) on statefp20, countyfp20 from census_blocks ;
analyze census_blocks ;

Afterwards, the estimated rows are much closer. Instead of multiplying sampled n_distinct values, the estimate is based on the distinct count captured during the last analyze sample. (This one is not based on frequencies.)

explain select statefp20, countyfp20, count(*) from census_blocks group by statefp20, countyfp20;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1122972.20..1123722.97 rows=2906 width=15)
   Group Key: statefp20, countyfp20
   ->  Gather Merge  (cost=1122972.20..1123650.32 rows=5812 width=15)
         Workers Planned: 2
         ->  Sort  (cost=1121972.18..1121979.45 rows=2906 width=15)
               Sort Key: statefp20, countyfp20
               ->  Partial HashAggregate  (cost=1121775.96..1121805.02 rows=2906 width=15)
                     Group Key: statefp20, countyfp20
                     ->  Parallel Seq Scan on census_blocks  (cost=0.00..1097416.83 rows=3247883 width=7)

Summary

The planner in Postgres uses internal stats tables to determine the best query paths. By setting appropriate extended statistics, you can give the planner more information to do this based on what you know about your data.