| Server IP : 104.21.25.180 / Your IP : 104.23.197.122 Web Server : Apache/2.4.37 System : Linux almalinux.duckdns.org 4.18.0-553.111.1.el8_10.x86_64 #1 SMP Sun Mar 8 20:06:07 EDT 2026 x86_64 User : ricodeal ( 1046) PHP Version : 7.4.33 Disable Function : NONE MySQL : OFF | cURL : ON | WGET : ON | Perl : ON | Python : ON | Sudo : ON | Pkexec : ON Directory : /usr/share/doc/postgresql-docs/html/ |
Upload File : |
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>69.2. Multivariate Statistics Examples</title><link rel="stylesheet" type="text/css" href="stylesheet.css" /><link rev="made" href="[email protected]" /><meta name="generator" content="DocBook XSL Stylesheets Vsnapshot" /><link rel="prev" href="row-estimation-examples.html" title="69.1. Row Estimation Examples" /><link rel="next" href="planner-stats-security.html" title="69.3. Planner Statistics and Security" /></head><body><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="5" align="center">69.2. Multivariate Statistics Examples</th></tr><tr><td width="10%" align="left"><a accesskey="p" href="row-estimation-examples.html" title="69.1. Row Estimation Examples">Prev</a> </td><td width="10%" align="left"><a accesskey="u" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Up</a></td><th width="60%" align="center">Chapter 69. How the Planner Uses Statistics</th><td width="10%" align="right"><a accesskey="h" href="index.html" title="PostgreSQL 10.23 Documentation">Home</a></td><td width="10%" align="right"> <a accesskey="n" href="planner-stats-security.html" title="69.3. Planner Statistics and Security">Next</a></td></tr></table><hr></hr></div><div class="sect1" id="MULTIVARIATE-STATISTICS-EXAMPLES"><div class="titlepage"><div><div><h2 class="title" style="clear: both">69.2. Multivariate Statistics Examples</h2></div></div></div><div class="toc"><dl class="toc"><dt><span class="sect2"><a href="multivariate-statistics-examples.html#id-1.10.22.5.3">69.2.1. Functional Dependencies</a></span></dt><dt><span class="sect2"><a href="multivariate-statistics-examples.html#id-1.10.22.5.4">69.2.2. Multivariate N-Distinct Counts</a></span></dt></dl></div><a id="id-1.10.22.5.2" class="indexterm"></a><div class="sect2" id="id-1.10.22.5.3"><div class="titlepage"><div><div><h3 class="title">69.2.1. Functional Dependencies</h3></div></div></div><p> Multivariate correlation can be demonstrated with a very simple data set
— a table with two columns, both containing the same values:
</p><pre class="programlisting">CREATE TABLE t (a INT, b INT);
INSERT INTO t SELECT i % 100, i % 100 FROM generate_series(1, 10000) s(i);
ANALYZE t;</pre><p>
As explained in <a class="xref" href="planner-stats.html" title="14.2. Statistics Used by the Planner">Section 14.2</a>, the planner can determine
cardinality of <code class="structname">t</code> using the number of pages and
rows obtained from <code class="structname">pg_class</code>:
</p><pre class="programlisting">SELECT relpages, reltuples FROM pg_class WHERE relname = 't';
relpages | reltuples
----------+-----------
45 | 10000</pre><p>
The data distribution is very simple; there are only 100 distinct values
in each column, uniformly distributed.
</p><p> The following example shows the result of estimating a <code class="literal">WHERE</code>
condition on the <code class="structfield">a</code> column:
</p><pre class="programlisting">EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1;
QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..170.00 rows=100 width=8) (actual rows=100 loops=1)
Filter: (a = 1)
Rows Removed by Filter: 9900</pre><p>
The planner examines the condition and determines the selectivity
of this clause to be 1%. By comparing this estimate and the actual
number of rows, we see that the estimate is very accurate
(in fact exact, as the table is very small). Changing the
<code class="literal">WHERE</code> condition to use the <code class="structfield">b</code> column, an
identical plan is generated. But observe what happens if we apply the same
condition on both columns, combining them with <code class="literal">AND</code>:
</p><pre class="programlisting">EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-----------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900</pre><p>
The planner estimates the selectivity for each condition individually,
arriving at the same 1% estimates as above. Then it assumes that the
conditions are independent, and so it multiplies their selectivities,
producing a final selectivity estimate of just 0.01%.
This is a significant underestimate, as the actual number of rows
matching the conditions (100) is two orders of magnitude higher.
</p><p> This problem can be fixed by creating a statistics object that
directs <code class="command">ANALYZE</code> to calculate functional-dependency
multivariate statistics on the two columns:
</p><pre class="programlisting">CREATE STATISTICS stts (dependencies) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 1;
QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=100 loops=1)
Filter: ((a = 1) AND (b = 1))
Rows Removed by Filter: 9900</pre><p>
</p></div><div class="sect2" id="id-1.10.22.5.4"><div class="titlepage"><div><div><h3 class="title">69.2.2. Multivariate N-Distinct Counts</h3></div></div></div><p> A similar problem occurs with estimation of the cardinality of sets of
multiple columns, such as the number of groups that would be generated by
a <code class="command">GROUP BY</code> clause. When <code class="command">GROUP BY</code>
lists a single column, the n-distinct estimate (which is visible as the
estimated number of rows returned by the HashAggregate node) is very
accurate:
</p><pre class="programlisting">EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a;
QUERY PLAN
-----------------------------------------------------------------------------------------
HashAggregate (cost=195.00..196.00 rows=100 width=12) (actual rows=100 loops=1)
Group Key: a
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=4) (actual rows=10000 loops=1)</pre><p>
But without multivariate statistics, the estimate for the number of
groups in a query with two columns in <code class="command">GROUP BY</code>, as
in the following example, is off by an order of magnitude:
</p><pre class="programlisting">EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
QUERY PLAN
--------------------------------------------------------------------------------------------
HashAggregate (cost=220.00..230.00 rows=1000 width=16) (actual rows=100 loops=1)
Group Key: a, b
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)</pre><p>
By redefining the statistics object to include n-distinct counts for the
two columns, the estimate is much improved:
</p><pre class="programlisting">DROP STATISTICS stts;
CREATE STATISTICS stts (dependencies, ndistinct) ON a, b FROM t;
ANALYZE t;
EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t GROUP BY a, b;
QUERY PLAN
--------------------------------------------------------------------------------------------
HashAggregate (cost=220.00..221.00 rows=100 width=16) (actual rows=100 loops=1)
Group Key: a, b
-> Seq Scan on t (cost=0.00..145.00 rows=10000 width=8) (actual rows=10000 loops=1)</pre><p>
</p></div></div><div xmlns="http://www.w3.org/TR/xhtml1/transitional" class="navfooter"><hr></hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="row-estimation-examples.html" title="69.1. Row Estimation Examples">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="planner-stats-details.html" title="Chapter 69. How the Planner Uses Statistics">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="planner-stats-security.html" title="69.3. Planner Statistics and Security">Next</a></td></tr><tr><td width="40%" align="left" valign="top">69.1. Row Estimation Examples </td><td width="20%" align="center"><a accesskey="h" href="index.html" title="PostgreSQL 10.23 Documentation">Home</a></td><td width="40%" align="right" valign="top"> 69.3. Planner Statistics and Security</td></tr></table></div></body></html>