PostgreSQL利用递归优化求稀疏列唯一值

在数据库中经常会碰到一些表的列是稀疏列，只有很少的值，例如性别字段，一般就只有2种不同的值。
但是当我们求这些稀疏列的唯一值时，如果表的数据量很大，速度还是会很慢。

例如：
创建测试表

bill=# create table t_sex (sex char(1), otherinfo text);
CREATE TABLE
bill=# insert into t_sex select 'm', generate_series(1,10000000)||'this is test';
INSERT 0 10000000
bill=# insert into t_sex select 'w', generate_series(1,10000000)||'this is test';
INSERT 0 10000000

查询：
可以看到下面的查询速度很慢。

bill=# select count(distinct sex) from t_sex;
 count
-------
     2
(1 row)

Time: 8803.505 ms (00:08.804)
bill=# select sex from t_sex t group by sex;
 sex
-----
 m
 w
(2 rows)

Time: 1026.464 ms (00:01.026)

那么我们对该字段加上索引又是什么情况呢？

bill=# create index idx_sex_1 on t_sex(sex);
CREATE INDEX
bill=#  select count(distinct sex) from t_sex;
 count
-------
     2
(1 row)

Time: 8502.460 ms (00:08.502)
bill=# select sex from t_sex t group by sex;
 sex
-----
 m
 w
(2 rows)

Time: 572.353 ms

速度依然没有明显的变化，可以看到执行计划已经使用Index Only Scan了。

bill=# explain  select count(distinct sex) from t_sex;
                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Aggregate  (cost=371996.44..371996.45 rows=1 width=8)
   ->  Index Only Scan using idx_sex_1 on t_sex  (cost=0.44..321996.44 rows=20000000 width=2)
(2 rows)

同样的SQL我们看看在Oracle中性能如何？

创建测试表：

SQL> create table t_sex (sex char(1), otherinfo varchar2(100));

Table created.

SQL> insert into t_sex select 'm', rownum||'this is test' from dual connect by level <=10000000;

10000000 rows created.

SQL> commit;

Commit complete.

SQL> insert into t_sex select 'w', rownum||'this is test' from dual connect by level <=10000000;

10000000 rows created.

SQL> commit;

Commit complete.

性能测试：

SQL> set lines 1000 pages 2000
SQL> set autotrace on
SQL> set timing on

SQL> select count(distinct sex) from t_sex;

COUNT(DISTINCTSEX)
------------------
                 2

Elapsed: 00:00:01.58

Execution Plan
----------------------------------------------------------
Plan hash value: 3915432945

----------------------------------------------------------------------------
| Id  | Operation          | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |     1 |     3 | 20132   (1)| 00:00:01 |
|   1 |  SORT GROUP BY     |       |     1 |     3 |            |          |
|   2 |   TABLE ACCESS FULL| T_SEX |    14M|    42M| 20132   (1)| 00:00:01 |
----------------------------------------------------------------------------

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      74074  consistent gets
          0  physical reads
          0  redo size
        552  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL> select sex from t_sex t group by sex;

SE
--
m
w

Elapsed: 00:00:01.08

Execution Plan
----------------------------------------------------------
Plan hash value: 3915432945

----------------------------------------------------------------------------
| Id  | Operation          | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |       |    14M|    42M| 20558   (3)| 00:00:01 |
|   1 |  SORT GROUP BY     |       |    14M|    42M| 20558   (3)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T_SEX |    14M|    42M| 20132   (1)| 00:00:01 |
----------------------------------------------------------------------------

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      74074  consistent gets
          0  physical reads
          0  redo size
        589  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          2  rows processed

可以看到Oracle的性能即使不加索引也明显比PostgreSQL中要好。
那么我们在PostgreSQL中是不是没办法继续优化了呢？这种情况我们利用pg中的递归语句结合索引可以大幅提升性能。

SQL改写：

bill=# with recursive tmp as (
bill(#   (
bill(#     select min(t.sex) as sex from t_sex t where t.sex is not null
bill(#   )
bill(#   union all
bill(#   (
bill(#     select (select min(t.sex) from t_sex t where t.sex > s.sex and t.sex is not null)
bill(#       from tmp s where s.sex is not null
bill(#   )
bill(# )
bill-# select count(distinct sex) from tmp;
 count
-------
     2
(1 row)

Time: 2.711 ms

查看执行计划：

bill=# explain with recursive tmp as (
bill(#   (
bill(#     select min(t.sex) as sex from t_sex t where t.sex is not null
bill(#   )
bill(#   union all
bill(#   (
bill(#     select (select min(t.sex) from t_sex t where t.sex > s.sex and t.sex is not null)
bill(#       from tmp s where s.sex is not null
bill(#   )
bill(# )
bill-# select count(distinct sex) from tmp;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=53.62..53.63 rows=1 width=8)
   CTE tmp
     ->  Recursive Union  (cost=0.46..51.35 rows=101 width=32)
           ->  Result  (cost=0.46..0.47 rows=1 width=32)
                 InitPlan 3 (returns $1)
                   ->  Limit  (cost=0.44..0.46 rows=1 width=2)
                         ->  Index Only Scan using idx_sex_1 on t_sex t  (cost=0.44..371996.44 rows=20000000 width=2)
                               Index Cond: (sex IS NOT NULL)
           ->  WorkTable Scan on tmp s  (cost=0.00..4.89 rows=10 width=32)
                 Filter: (sex IS NOT NULL)
   ->  CTE Scan on tmp  (cost=0.00..2.02 rows=101 width=32)
(11 rows)

Time: 1.371 ms

可以看到执行时间从原先的8000ms降低到了2ms，提升了几千倍！

甚至对比Oracle，性能也是提升了很多。

但是需要注意的是：这种写法仅仅是针对稀疏列，换成数据分布广泛的字段，显然性能是下降的, 所以使用递归SQL不适合数据分布广泛的字段的group by或者count(distinct)操作。

PostgreSQL利用递归优化求稀疏列唯一值

猜你喜欢