PostgreSQL , 模糊查询 , 正则查询 , pg_trgm , bytea , gin , 函数索引
前模糊(有前缀的模糊),后模糊(有后缀的模糊),前后模糊(无前后缀的模糊),正则匹配都属于文本搜索领域常见的需求。
PostgreSQL拥有很强的文本搜索能力,除了支持全文检索,还支持模糊查询、正则查询。内置的pg_trgm插件是一般数据库没有的,可能很多人没有听说过。同时还内置了表达式索引、GIN索引的功能。
不同的模糊查询需求,有不同的优化方法。
对于前模糊和后模糊,PostgreSQL则与其他数据库一样,可以使用btree来加速。后模糊可以使用反转函数的函数索引来加速。
对于前后模糊和正则匹配,一种方法是使用pg_trgm插件,利用GIN索引加速模糊和正则查询(输入3个或3个以上字符的模糊查询效果很好)。另一种方法是自定义GIN表达式索引的方法,适合于定制的模糊查询。
1. 前模糊(有前缀的模糊)优化方法
使用b-tree可以支持前模糊的查询。
1.1 当使用类型默认的index ops class时,仅适合于collate="C"的查询(当数据库默认的lc_collate<>C时,索引和查询都需要明确指定collate "C")。
索引、查询条件的collate必须一致才能使用索引。
例子
test=# create table test(id int, info text); CREATE TABLE test=# insert into test select generate_series(1,1000000),md5(random()::text); INSERT 0 1000000 test=# create index idx on test(info collate "C"); CREATE INDEX test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C"; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1) Output: id, info Index Cond: ((test.info >= 'abcd'::text) AND (test.info < 'abce'::text)) Filter: (test.info ~~ 'abcd%'::text COLLATE "C") Buffers: shared hit=18 read=3 Planning time: 0.424 ms Execution time: 0.124 ms (7 rows)1.2 当数据库默认的lc_collate<>C时,还有一种方法让b-tree索引支持模糊查询。使用对应类型的pattern ops,使用pattern ops将使用字符的查询方式而非binary的搜索方式。
文档中有如下解释
https://www.postgresql.org/docs/9.6/static/indexes-opclass.html
The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops support B-tree indexes on the types text, varchar, and char respectively. The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard "C" locale.例子
test=# drop table test; DROP TABLE test=# create table test(id int, info text); CREATE TABLE test=# insert into test select generate_series(1,1000000),md5(random()::text); INSERT 0 1000000 test=# create index idx on test(info text_pattern_ops); CREATE INDEX test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "zh_CN"; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.038..0.059 rows=12 loops=1) Output: id, info Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text)) Filter: (test.info ~~ 'abcd%'::text COLLATE "zh_CN") Buffers: shared hit=12 read=3 Planning time: 0.253 ms Execution time: 0.081 ms (7 rows) test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C"; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.027..0.050 rows=12 loops=1) Output: id, info Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text)) Filter: (test.info ~~ 'abcd%'::text COLLATE "C") Buffers: shared hit=15 Planning time: 0.141 ms Execution time: 0.072 ms (7 rows)使用类型对应的pattern ops,索引搜索不仅支持LIKE的写法,还支持规则表达式的写法,如下:
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '^abcd'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.031..0.061 rows=12 loops=1) Output: id, info Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text)) Filter: (test.info ~ '^abcd'::text) Buffers: shared hit=15 Planning time: 0.213 ms Execution time: 0.083 ms (7 rows)2. 后模糊(有后缀的模糊)的优化方法
使用反转函数(reverse)索引,可以支持后模糊的查询。
2.1 当使用类型默认的index ops class时,仅适合于collate="C"的查询(当数据库默认的lc_collate<>C时,索引和查询都需要明确指定collate "C")。
索引、查询条件的collate必须一致才能使用索引。
例子
test=# create index idx1 on test(reverse(info) collate "C"); CREATE INDEX test=# select * from test limit 1; id | info ----+---------------------------------- 1 | b3275976cdd437a033d4329775a52514 (1 row) test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%' collate "C"; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.061..0.097 rows=18 loops=1) Output: id, info Index Cond: ((reverse(test.info) >= '4152'::text) AND (reverse(test.info) < '4153'::text)) Filter: (reverse(test.info) ~~ '4152%'::text COLLATE "C") Buffers: shared hit=18 read=3 Planning time: 0.128 ms Execution time: 0.122 ms (7 rows) test=# select * from test where reverse(info) like '4152%' collate "C"; id | info --------+---------------------------------- 847904 | abe2ecd90393b5275df8e34a39702514 414702 | 97f66d26545329321164042657d02514 191232 | 7820972c6220c2b01d46c11ebb532514 752742 | 93232ac39c6632e2540df44627c42514 217302 | 39e518893a1a7b1e691619bd1fc42514 1 | b3275976cdd437a033d4329775a52514 615718 | 4948f94c484c13dc6c4fae8a3db52514 308815 | fc2918ceff7c7a4dafd2e04031062514 149521 | 546d963842ea5ca593e622c810262514 811093 | 4b6eca2eb6b665af67b2813e91a62514 209000 | 1dfd0d4e326715c1739f031cca992514 937616 | 8827fd81f5b673fb5afecbe3e11b2514 419553 | bd6e01ce360af16137e8b6abc8ab2514 998324 | 7dff51c19dc5e5d9979163e7d14c2514 771518 | 8a54e30003a48539fff0aedc73ac2514 691566 | f90368348e3b6bf983fcbe10db2d2514 652274 | 8bf4a97b5f122a5540a21fa85ead2514 233437 | 739ed715fc203d47e37e79b5bcbe2514 (18 rows)2.2 当数据库默认的lc_collate<>C时,还有一种方法让b-tree索引支持模糊查询。使用对应类型的pattern ops,使用pattern ops将使用字符的查询方式而非binary的搜索方式。
使用类型对应的pattern ops,索引搜索不仅支持LIKE的写法,还支持规则表达式的写法。
例子
test=# create index idx1 on test(reverse(info) text_pattern_ops); CREATE INDEX test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) like '4152%'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.026..0.049 rows=12 loops=1) Output: id, info Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text)) Filter: (reverse(test.info) ~~ '4152%'::text) Buffers: shared hit=15 Planning time: 0.102 ms Execution time: 0.072 ms (7 rows) test=# explain (analyze,verbose,timing,costs,buffers) select * from test where reverse(info) ~ '^4152'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Index Scan using idx1 on public.test (cost=0.42..4009.43 rows=5000 width=37) (actual time=0.031..0.063 rows=12 loops=1) Output: id, info Index Cond: ((reverse(test.info) ~>=~ '4152'::text) AND (reverse(test.info) ~<~ '4153'::text)) Filter: (reverse(test.info) ~ '^4152'::text) Buffers: shared hit=15 Planning time: 0.148 ms Execution time: 0.087 ms (7 rows)3. 前、后模糊的合体优化方法
使用pg_trgm索引,可以支持前、后模糊的查询。
注意:
(有前缀的模糊)至少输入1个字符,(有后缀的模糊)至少输入2个字符,才有好的索引过滤效果。
如果要高效支持多字节字符(例如中文),数据库lc_ctype不能为"C",只有TOKEN分割正确效果才是OK的。(因为lc_ctype决定了多字节字符中什么是字:LC_CTYPE: Character classification (What is a letter? Its upper-case equivalent?))。
索引、查询条件的collate必须一致才能使用索引。
test=# \l+ test List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description ------+----------+----------+------------+------------+-------------------+--------+------------+------------- test | postgres | UTF8 | zh_CN.utf8 | zh_CN.utf8 | | 245 MB | pg_default | (1 row) test=# create extension pg_trgm; test=# create table test001(c1 text); CREATE TABLE生成随机中文字符串的函数
test=# create or replace function gen_hanzi(int) returns text as $$ declare res text; begin if $1 >=1 then select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1); return res; end if; return null; end; $$ language plpgsql strict; CREATE FUNCTION生成随机数据
test=# insert into test001 select gen_hanzi(20) from generate_series(1,100000); INSERT 0 100000 test=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops); CREATE INDEX test=# select * from test001 limit 5; c1 ------------------------------------------ 埳噪办甾讷昃碇玾陧箖燋邢賀浮媊踮菵暔谉橅 秌橑籛鴎拟倶敤麁鼋醠轇坙騉鏦纗蘛婃坹娴儅 蔎緾鎧爪鵬二悲膼朠麻鸂鋬楨窷違繇糭嘓索籓 馳泅薬鐗愅撞窍浉渗蛁灎厀攚摐瞪拡擜詜隝緼 襳铺煃匶瀌懲荼黹樆惺箧搔羾憯墆鋃硍蔓恧顤 (5 rows)模糊查询
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.030..0.034 rows=3 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '你%'::text) Heap Blocks: exact=3 Buffers: shared hit=7 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=3 loops=1) Index Cond: (test001.c1 ~~ '你%'::text) Buffers: shared hit=4 Planning time: 0.119 ms Execution time: 0.063 ms (10 rows) test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%恧顤'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.031..0.034 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%恧顤'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=2 Buffers: shared hit=6 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.020..0.020 rows=2 loops=1) Index Cond: (test001.c1 ~~ '%恧顤'::text) Buffers: shared hit=4 Planning time: 0.136 ms Execution time: 0.062 ms (11 rows)使用pg_trgm插件,支持前后模糊的查询。
注意:
如果要让pg_trgm高效支持多字节字符(例如中文),数据库lc_ctype不能为"C",只有TOKEN分割正确效果才是OK的。(因为lc_ctype决定了多字节字符中什么是字:LC_CTYPE: Character classification (What is a letter? Its upper-case equivalent?))。
建议输入3个或3个以上字符,否则效果不佳(后面会分析原因)。
例子
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢賀%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=5.08..15.20 rows=10 width=61) (actual time=0.038..0.038 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%燋邢賀%'::text) Heap Blocks: exact=1 Buffers: shared hit=5 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..5.08 rows=10 width=0) (actual time=0.025..0.025 rows=1 loops=1) Index Cond: (test001.c1 ~~ '%燋邢賀%'::text) Buffers: shared hit=4 Planning time: 0.170 ms Execution time: 0.076 ms (10 rows) test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%燋邢%'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=7615669.08..7615679.20 rows=10 width=61) (actual time=147.524..178.232 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~~ '%燋邢%'::text) Rows Removed by Index Recheck: 99999 Heap Blocks: exact=1137 Buffers: shared hit=14429 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..7615669.08 rows=10 width=0) (actual time=147.377..147.377 rows=100000 loops=1) Index Cond: (test001.c1 ~~ '%燋邢%'::text) Buffers: shared hit=13292 Planning time: 0.133 ms Execution time: 178.265 ms (11 rows)PostgreSQL 正则匹配的语法为 字符串 ~ 'pattern' 或 字符串 ~* 'pattern'
https://www.postgresql.org/docs/9.6/static/functions-matching.html
例子
test=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '12[0-9]{3,9}'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on public.test001 (cost=65.08..75.20 rows=10 width=61) (actual time=0.196..0.196 rows=0 loops=1) Output: c1 Recheck Cond: (test001.c1 ~ '12[0-9]{3,9}'::text) Rows Removed by Index Recheck: 1 Heap Blocks: exact=1 Buffers: shared hit=50 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..65.08 rows=10 width=0) (actual time=0.183..0.183 rows=1 loops=1) Index Cond: (test001.c1 ~ '12[0-9]{3,9}'::text) Buffers: shared hit=49 Planning time: 0.452 ms Execution time: 0.221 ms (11 rows) test01=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 ~ '宸朾啣' collate "zh_CN"; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test001 (cost=6.58..19.42 rows=10 width=61) (actual time=0.061..0.061 rows=1 loops=1) Output: c1 Recheck Cond: (test001.c1 ~ '宸朾啣'::text COLLATE "zh_CN") Heap Blocks: exact=1 Buffers: shared hit=5 -> Bitmap Index Scan on idx_test001_1 (cost=0.00..6.58 rows=10 width=0) (actual time=0.049..0.049 rows=1 loops=1) Index Cond: (test001.c1 ~ '宸朾啣'::text COLLATE "zh_CN") Buffers: shared hit=4 Planning time: 0.238 ms Execution time: 0.082 ms (10 rows)正则匹配索引原理参考contrib/pg_trgm/trgm_regexp.c
首先,pg_trgm将字符串的前端添加2个空格,末尾添加1个空格。
然后,每连续的3个字符为一个TOKEN,拆开。
最后,对TOKEN建立GIN倒排索引。
查看字符串的TOKEN,可以使用如下方法。
test=# select show_trgm('123'); show_trgm ------------------------- {" 1"," 12",123,"23 "} (1 row)使用pg_trgm时,如果要获得最好的效果,最好满足这些条件。
1. 有前缀的模糊查询,例如a%,至少需要提供1个字符。( 搜索的是token=' a' )
2. 有后缀的模糊查询,例如
