全文检索 (不包含、不等于) 索引优化 - 阿里云RDS PostgreSQL最佳实践

    xiaoxiao2022-07-08  224

    背景

    PostgreSQL内置了GIN索引,支持全文检索,支持数组检索等多值数据类型的检索。

    在全文检索中,不包含某个关键字能用到索引吗?

    实际上GIN是倒排索引,不包含某个关键字的查询,实际上是跳过主tree上面的TOKEN的扫描。

    只要被跳过的TOKEN包含了大量数据,那么就是划算的。PostgreSQL是基于CBO的执行计划优化器,所以会自动选择最优的索引。

    例子1,全文检索不包含查询

    1、创建测试表

    postgres=# create table notcontain (id int, info tsvector); CREATE TABLE

    2、创建生成随机字符串的函数

    CREATE OR REPLACE FUNCTION gen_rand_str(integer) RETURNS text LANGUAGE sql STRICT AS $function$ select string_agg(a[(random()*6)::int+1],'') from generate_series(1,$1), (select array['a','b','c','d','e','f',' ']) t(a); $function$;

    3、插入100万测试数据

    postgres=# insert into notcontain select generate_series(1,1000000), to_tsvector(gen_rand_str(256));

    4、创建全文索引(GIN索引)

    create index idx_notcontain_info on notcontain using gin (info);

    5、查询某一条记录

    postgres=# select * from notcontain limit 1; -[ RECORD 1 ]---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- id | 1 info | 'afbbeeccbf':3 'b':16 'bdcdfd':2 'bdcfbcecdeeaed':8 'bfedfecbfab':7 'cd':9 'cdcaefaccdccadeafadededddcbdecdaefbcfbdaefcec':14 'ceafecff':6 'd':17,18 'dbc':12 'dceabcdcbdca':10 'dddfdbffffeaca':13 'deafcccfbcdebdaecda':11 'dfbadcdebdedbfa':19 'eb':15 'ebe':1 'febdcbdaeaeabbdadacabdbbedfafcaeabbdcedaeca':5 'fedeecbcdfcdceabbabbfcdd':4

    6、测试不包含某个关键字

    数据库自动选择了全表扫描,没有使用GIN索引。

    为什么没有使用索引呢,我前面解释了,因为这个关键字的数据记录太少了,不包含它时使用索引过滤不划算。

    (当包含它时,使用GIN索引就非常划算。包含和不包含是相反的过程,成本也是反的)

    select * from notcontain t1 where info @@ to_tsquery ('!eb'); postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!eb'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Seq Scan on postgres.notcontain t1 (cost=0.00..318054.51 rows=950820 width=412) (actual time=0.016..1087.463 rows=947911 loops=1) Output: id, info Filter: (t1.info @@ to_tsquery('!eb'::text)) Rows Removed by Filter: 52089 Buffers: shared hit=55549 Planning time: 0.131 ms Execution time: 1134.571 ms (7 rows)

    7、强制关闭全表扫描,让数据库选择索引。

    可以看到,使用索引确实是慢的,我们大多数时候应该相信数据库的成本规划是准确的。(只要成本因子和环境性能匹配足够的准,这些都是可以校准的,有兴趣的同学可以参考我写的因子校准方法。)

    postgres=# set enable_seqscan=off; SET postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!eb'); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on postgres.notcontain t1 (cost=82294981.00..82600120.25 rows=950820 width=412) (actual time=1325.587..1540.145 rows=947911 loops=1) Output: id, info Recheck Cond: (t1.info @@ to_tsquery('!eb'::text)) Heap Blocks: exact=55549 Buffers: shared hit=171948 -> Bitmap Index Scan on idx_notcontain_info (cost=0.00..82294743.30 rows=950820 width=0) (actual time=1315.663..1315.663 rows=947911 loops=1) Index Cond: (t1.info @@ to_tsquery('!eb'::text)) Buffers: shared hit=116399 Planning time: 0.135 ms Execution time: 1584.670 ms (10 rows)

    例子2,全文检索不包含查询

    这个例子造一份倾斜数据,这个TOKEN包含了大量的重复记录,通过不包含过滤它。看看能否使用索引。

    1、生成测试数据

    postgres=# truncate notcontain ; TRUNCATE TABLE postgres=# insert into notcontain select generate_series(1,1000000), to_tsvector('abc'); INSERT 0 1000000

    2、测试不包含ABC的检索

    数据库自动选择了索引扫描,跳过了不需要检索的数据块。

    postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!abc'); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on postgres.notcontain t1 (cost=220432.15..220433.71 rows=1 width=21) (actual time=107.936..107.936 rows=0 loops=1) Output: id, info Recheck Cond: (t1.info @@ to_tsquery('!abc'::text)) Buffers: shared hit=268 -> Bitmap Index Scan on idx_notcontain_info (cost=0.00..220432.15 rows=1 width=0) (actual time=107.933..107.933 rows=0 loops=1) Index Cond: (t1.info @@ to_tsquery('!abc'::text)) Buffers: shared hit=268 Planning time: 0.183 ms Execution time: 107.962 ms (9 rows)

    3、强制使用全表扫描,发现性能确实不如索引扫描,也验证了我们说的PostgreSQL是基于成本的优化器,自动选择最优的执行计划。

    postgres=# set enable_bitmapscan =off; SET postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where info @@ to_tsquery ('!abc'); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Seq Scan on postgres.notcontain t1 (cost=0.00..268870.00 rows=1 width=21) (actual time=1065.436..1065.436 rows=0 loops=1) Output: id, info Filter: (t1.info @@ to_tsquery('!abc'::text)) Rows Removed by Filter: 1000000 Buffers: shared hit=6370 Planning time: 0.059 ms Execution time: 1065.449 ms (7 rows)

    例子3,普通类型BTREE索引,不等于检索

    这个例子是普通类型,使用BTREE索引,看看是否支持不等于的索引检索。

    测试方法与GIN测试类似,使用倾斜和非倾斜两种测试数据。

    1、非倾斜数据的不包含查询,使用索引过滤的记录非常少。

    目前内核层面没有实现BTREE索引的不包含检索。(虽然技术上是可以通过INDEX SKIP SCAN来实现的,跳过不需要扫描的BRANCH节点)

    postgres=# truncate notcontain ; TRUNCATE TABLE postgres=# insert into notcontain select generate_series(1,1000000); INSERT 0 1000000 postgres=# create index idx1 on notcontain (id); CREATE INDEX postgres=# set enable_bitmapscan =on; SET postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where id<>1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Seq Scan on postgres.notcontain t1 (cost=0.00..16925.00 rows=999999 width=36) (actual time=0.011..110.592 rows=999999 loops=1) Output: id, info Filter: (t1.id <> 1) Rows Removed by Filter: 1 Buffers: shared hit=4425 Planning time: 0.195 ms Execution time: 156.013 ms (7 rows) postgres=# set enable_seqscan=off; SET postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where id<>1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on postgres.notcontain t1 (cost=10000000000.00..10000016925.00 rows=999999 width=36) (actual time=0.011..110.964 rows=999999 loops=1) Output: id, info Filter: (t1.id <> 1) Rows Removed by Filter: 1 Buffers: shared hit=4425 Planning time: 0.062 ms Execution time: 156.461 ms (7 rows)

    2、更换SQL写法,可以实现索引检索。但实际上由于不是使用的INDEX SKIP SCAN,所以需要一个JOIN过程,实际上效果并不佳。

    postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Merge Anti Join (cost=0.85..25497.28 rows=999999 width=36) (actual time=0.023..277.639 rows=999999 loops=1) Output: t1.id, t1.info Merge Cond: (t1.id = t2.id) Buffers: shared hit=7164 -> Index Scan using idx1 on postgres.notcontain t1 (cost=0.42..22994.22 rows=1000000 width=36) (actual time=0.009..148.520 rows=1000000 loops=1) Output: t1.id, t1.info Buffers: shared hit=7160 -> Index Only Scan using idx1 on postgres.notcontain t2 (cost=0.42..3.04 rows=1 width=4) (actual time=0.007..0.008 rows=1 loops=1) Output: t2.id Index Cond: (t2.id = 1) Heap Fetches: 1 Buffers: shared hit=4 Planning time: 0.223 ms Execution time: 322.798 ms (14 rows) postgres=# set enable_mergejoin=off; SET postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Hash Anti Join (cost=3.05..27053.05 rows=999999 width=36) (actual time=0.060..251.232 rows=999999 loops=1) Output: t1.id, t1.info Hash Cond: (t1.id = t2.id) Buffers: shared hit=4432 -> Seq Scan on postgres.notcontain t1 (cost=0.00..14425.00 rows=1000000 width=36) (actual time=0.011..84.659 rows=1000000 loops=1) Output: t1.id, t1.info Buffers: shared hit=4425 -> Hash (cost=3.04..3.04 rows=1 width=4) (actual time=0.014..0.014 rows=1 loops=1) Output: t2.id Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=4 -> Index Only Scan using idx1 on postgres.notcontain t2 (cost=0.42..3.04 rows=1 width=4) (actual time=0.010..0.011 rows=1 loops=1) Output: t2.id Index Cond: (t2.id = 1) Heap Fetches: 1 Buffers: shared hit=4 Planning time: 0.147 ms Execution time: 297.127 ms (18 rows) postgres=# set enable_seqscan=off; SET postgres=# explain (analyze,verbose,timing,costs,buffers) select * from notcontain t1 where not exists (select 1 from notcontain t2 where t1.id=t2.id and t2.id=1); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Hash Anti Join (cost=3.48..35622.27 rows=999999 width=36) (actual time=0.036..324.401 rows=999999 loops=1) Output: t1.id, t1.info Hash Cond: (t1.id = t2.id) Buffers: shared hit=7164 -> Index Scan using idx1 on postgres.notcontain t1 (cost=0.42..22994.22 rows=1000000 width=36) (actual time=0.017..149.383 rows=1000000 loops=1) Output: t1.id, t1.info Buffers: shared hit=7160 -> Hash (cost=3.04..3.04 rows=1 width=4) (actual time=0.011..0.011 rows=1 loops=1) Output: t2.id Buckets: 1024 Batches: 1 Memory Usage: 9kB Buffers: shared hit=4 -> Index Only Scan using idx1 on postgres.notcontain t2 (cost=0.42..3.04 rows=1 width=4) (actual time=0.008..0.009 rows=1 loops=1) Output: t2.id Index Cond: (t2.id = 1) Heap Fetches: 1 Buffers: shared hit=4 Planning time: 0.141 ms Execution time: 369.749 ms (18 rows)

    3、PostgreSQL还支持多核并行,所以全表扫描还可以暴力提升性能。

    如果记录数非常多,使用并行扫描,性能提升非常明显。

    postgres=# create unlogged table ptbl(id int); CREATE TABLE postgres=# insert into ptbl select generate_series(1,100000000); postgres=# alter table ptbl set (parallel_workers =32); \timing 非并行查询: postgres=# set max_parallel_workers_per_gather =0; postgres=# select count(*) from ptbl where id<>1; count ---------- 99999999 (1 row) Time: 11863.151 ms (00:11.863) 并行查询: postgres=# set max_parallel_workers_per_gather =32; postgres=# select count(*) from ptbl where id<>1; count ---------- 99999999 (1 row) Time: 610.017 ms

    使用并行查询后,性能提升非常明显。

    例子4,普通类型partial BTREE索引,不等于检索

    对于固定的不等于查询,我们可以使用PostgreSQL的partial index功能。

    create table tbl (id int, info text, crt_time timestamp, c1 int); select * from tbl where c1<>1; insert into tbl select generate_series(1,10000000), 'test', now(), 1; insert into tbl values (1,'abc',now(),2); create index idx_tbl_1 on tbl(id) where c1<>1;

    cool,使用PARTIAL INDEX,0.03毫秒,在1000万数据中进行不等于检索。

    postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1<>1; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_tbl_1 on postgres.tbl (cost=0.12..1.44 rows=1 width=21) (actual time=0.015..0.015 rows=1 loops=1) Output: id, info, crt_time, c1 Buffers: shared hit=1 read=1 Planning time: 0.194 ms Execution time: 0.030 ms (5 rows)

    小结

    1、PostgreSQL内置了GIN索引,支持全文检索、支持数组等多值类型的搜索。

    2、PostgreSQL使用基于成本的执行计划优化器,会自动选择最优的执行计划,在进行不包含检索时,PostgreSQL会自动选择是否使用索引扫描。

    3、对于BTREE索引,理论上也能实现不等于的搜索(INDEX SKIP SCAN),目前内核层面还没有实现它,目前可以通过调整SQL的写法来使用索引扫描。

    4、PostgreSQL还支持多核并行,所以全表扫描还可以暴力提升性能。 如果记录数非常多,使用并行扫描,性能提升非常明显。

    5、PostgreSQL支持partial index,可以用于分区索引,或者部分索引。对于固定条件的不等于查询,效果非常显著。

    相关资源:新年快乐! python实现绚烂的烟花绽放效果
    最新回复(0)