PostgreSQL 按拼音排序 - convert to GBKEUC

xiaoxiao2025-09-18 154

背景

国内的应用，在文本排序上基本都是按照拼音来进行排序的。

在不同的字符集中，汉字的编码可能不一样，比如UTF8和GBK，其中GBK是按拼音的顺序进行编码的，而UTF8则不是。

所以如果你的数据库使用了UTF8编码，对中文字段进行排序时，可能得到的并不是按拼音排序的结果。

在PostgreSQL中，中文按拼音排序的编码包括GB18030, EUC_CN, GBK, BIG5, EUC_TW 等。

为了得到拼音排序，可以使用编码转换后的值来排序，索引也可以使用编码转换的表达式索引。

编码

PostgreSQL支持的编码如下

https://www.postgresql.org/docs/9.6/static/multibyte.html

PostgreSQL Character Sets

NameDescriptionLanguageServer?Bytes/CharAliasesBIG5Big FiveTraditional ChineseNo1-2WIN950, Windows950EUC_CNExtended UNIX Code-CNSimplified ChineseYes1-3-EUC_JPExtended UNIX Code-JPJapaneseYes1-3-EUC_JIS_2004Extended UNIX Code-JP, JIS X 0213JapaneseYes1-3-EUC_KRExtended UNIX Code-KRKoreanYes1-3-EUC_TWExtended UNIX Code-TWTraditional Chinese, TaiwaneseYes1-3-GB18030National StandardChineseNo1-4-GBKExtended National StandardSimplified ChineseNo1-2WIN936, Windows936ISO_8859_5ISO 8859-5, ECMA 113Latin/CyrillicYes1-ISO_8859_6ISO 8859-6, ECMA 114Latin/ArabicYes1-ISO_8859_7ISO 8859-7, ECMA 118Latin/GreekYes1-ISO_8859_8ISO 8859-8, ECMA 121Latin/HebrewYes1-JOHABJOHABKorean (Hangul)No1-3-KOI8RKOI8-RCyrillic (Russian)Yes1KOI8KOI8UKOI8-UCyrillic (Ukrainian)Yes1-LATIN1ISO 8859-1, ECMA 94Western EuropeanYes1ISO88591LATIN2ISO 8859-2, ECMA 94Central EuropeanYes1ISO88592LATIN3ISO 8859-3, ECMA 94South EuropeanYes1ISO88593LATIN4ISO 8859-4, ECMA 94North EuropeanYes1ISO88594LATIN5ISO 8859-9, ECMA 128TurkishYes1ISO88599LATIN6ISO 8859-10, ECMA 144NordicYes1ISO885910LATIN7ISO 8859-13BalticYes1ISO885913LATIN8ISO 8859-14CelticYes1ISO885914LATIN9ISO 8859-15LATIN1 with Euro and accentsYes1ISO885915LATIN10ISO 8859-16, ASRO SR 14111RomanianYes1ISO885916MULE_INTERNALMule internal codeMultilingual EmacsYes1-4SJISShift JISJapaneseNo1-2Mskanji, ShiftJIS, WIN932, Windows932SHIFT_JIS_2004Shift JIS, JIS X 0213JapaneseNo1-2-SQL_ASCIIunspecified (see text)anyYes1-UHCUnified Hangul CodeKoreanNo1-2WIN949, Windows949UTF8Unicode, 8-bitallYes1-4UnicodeWIN866Windows CP866CyrillicYes1ALTWIN874Windows CP874ThaiYes1-WIN1250Windows CP1250Central EuropeanYes1-WIN1251Windows CP1251CyrillicYes1WINWIN1252Windows CP1252Western EuropeanYes1-WIN1253Windows CP1253GreekYes1-WIN1254Windows CP1254TurkishYes1-WIN1255Windows CP1255HebrewYes1-WIN1256Windows CP1256ArabicYes1-WIN1257Windows CP1257BalticYes1-WIN1258Windows CP1258VietnameseYes1ABC, TCVN, TCVN5712, VSCII

与中文编码排序相关的包括 GB18030, EUC_CN, GBK, BIG5, EUC_TW

简体常用的包括GBK, EUC_CN。

编码转换

在PostgreSQL中，如果要将字符从一个编码转换为另一个编码，需要告诉数据库（create conversion）怎么转换（使用什么C函数），PG内置了一些转换的C函数和转换方法。

https://www.postgresql.org/docs/9.6/static/catalog-pg-conversion.html

pg_conversion

NameTypeReferencesDescriptionoidoid-Row identifier (hidden attribute; must be explicitly selected)connamename-Conversion name (unique within a namespace)connamespaceoidpg_namespace.oidThe OID of the namespace that contains this conversionconowneroidpg_authid.oidOwner of the conversionconforencodingint4-Source encoding IDcontoencodingint4-Destination encoding IDconprocregprocpg_proc.oidConversion procedurecondefaultbool-True if this is the default conversion

查看内置的转换方法

可以看到utf8转换为中文编码的都支持了

postgres=> select * from pg_conversion where conname ~* 'gbk|gb18|euc_cn|euc_tw|big5' order by 1; conname | connamespace | conowner | conforencoding | contoencoding | conproc | condefault -----------------+--------------+----------+----------------+---------------+-----------------+------------ big5_to_euc_tw | 11 | 10 | 36 | 4 | big5_to_euc_tw | t big5_to_mic | 11 | 10 | 36 | 7 | big5_to_mic | t big5_to_utf8 | 11 | 10 | 36 | 6 | big5_to_utf8 | t euc_cn_to_mic | 11 | 10 | 2 | 7 | euc_cn_to_mic | t euc_cn_to_utf8 | 11 | 10 | 2 | 6 | euc_cn_to_utf8 | t euc_tw_to_big5 | 11 | 10 | 4 | 36 | euc_tw_to_big5 | t euc_tw_to_mic | 11 | 10 | 4 | 7 | euc_tw_to_mic | t euc_tw_to_utf8 | 11 | 10 | 4 | 6 | euc_tw_to_utf8 | t gb18030_to_utf8 | 11 | 10 | 39 | 6 | gb18030_to_utf8 | t gbk_to_utf8 | 11 | 10 | 37 | 6 | gbk_to_utf8 | t mic_to_big5 | 11 | 10 | 7 | 36 | mic_to_big5 | t mic_to_euc_cn | 11 | 10 | 7 | 2 | mic_to_euc_cn | t mic_to_euc_tw | 11 | 10 | 7 | 4 | mic_to_euc_tw | t utf8_to_big5 | 11 | 10 | 6 | 36 | utf8_to_big5 | t utf8_to_euc_cn | 11 | 10 | 6 | 2 | utf8_to_euc_cn | t utf8_to_euc_tw | 11 | 10 | 6 | 4 | utf8_to_euc_tw | t utf8_to_gb18030 | 11 | 10 | 6 | 39 | utf8_to_gb18030 | t utf8_to_gbk | 11 | 10 | 6 | 37 | utf8_to_gbk | t (18 rows)

编码转换函数

注意数据库版本

PostgreSQL 8.x(如Greenplum), 将字符串从原编码转换为指定编码的字符串返回。

可能存在显示的问题。

PostgreSQL 9.x, 将源编码字符串的字节流转换为指定编码的字符串的字节流返回。

避免了显示的问题。

如果8.x需要避免显示问题，返回字节流，可以这样使用，推荐使用。

byteain(textout(convert(字符,'源编码','目标编码')))

例子

当前数据库编码为UTF-8，中文排序未按拼音排序。

postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+----------+----------+---------+-------+----------------------- db0 | postgres | UTF8 | C | C | postgres | postgres | UTF8 | C | C | template0 | postgres | UTF8 | C | C | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | C | C | =c/postgres + | | | | | postgres=CTc/postgres (4 rows) postgres=# select * from (values ('刘德华'), ('刘少奇')) t(id) order by id; id -------- 刘少奇刘德华 (2 rows)

按拼音排序方法(目标可以改成EUC_CN)

8.x postgres=> select * from (values ('刘德华'), ('刘少奇')) t(id) order by byteain(textout(convert(id,'UTF-8','GBK'))); id -------- 刘德华刘少奇 (2 rows) 9.x postgres=# select * from (values ('刘德华'), ('刘少奇')) t(id) order by convert(id::bytea,'UTF-8','GBK'); id -------- 刘德华刘少奇 (2 rows)

注意多音字

中文有一些多音字，比如重庆(chongqing), 但是编码时它可能是按zhong编码的，所以看这个例子。

postgres=> select * from (values ('中山'), ('重庆')) t(id) order by byteain(textout(convert(id,'UTF-8','GBK'))); id ------ 中山重庆 (2 rows)

索引

表达式索引即可，使用immutable function.

代码

src/backend/utils/mb/mbutils.c

/* * Convert string between two arbitrary encodings. * * BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name) */ Datum pg_convert(PG_FUNCTION_ARGS) { bytea *string = PG_GETARG_BYTEA_PP(0); char *src_encoding_name = NameStr(*PG_GETARG_NAME(1)); int src_encoding = pg_char_to_encoding(src_encoding_name); char *dest_encoding_name = NameStr(*PG_GETARG_NAME(2)); int dest_encoding = pg_char_to_encoding(dest_encoding_name); const char *src_str; char *dest_str; bytea *retval; int len; if (src_encoding < 0) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("invalid source encoding name \"%s\"", src_encoding_name))); if (dest_encoding < 0) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("invalid destination encoding name \"%s\"", dest_encoding_name))); /* make sure that source string is valid */ len = VARSIZE_ANY_EXHDR(string); src_str = VARDATA_ANY(string); pg_verify_mbstr_len(src_encoding, src_str, len, false); /* perform conversion */ dest_str = (char *) pg_do_encoding_conversion((unsigned char *) src_str, len, src_encoding, dest_encoding); /* update len if conversion actually happened */ if (dest_str != src_str) len = strlen(dest_str); /* * build bytea data type structure. */ retval = (bytea *) palloc(len + VARHDRSZ); SET_VARSIZE(retval, len + VARHDRSZ); memcpy(VARDATA(retval), dest_str, len); if (dest_str != src_str) pfree(dest_str); /* free memory if allocated by the toaster */ PG_FREE_IF_COPY(string, 0); PG_RETURN_BYTEA_P(retval); }

参考

1. https://www.postgresql.org/docs/9.6/static/multibyte.html#AEN39011

2. https://www.postgresql.org/docs/9.6/static/sql-createconversion.html

3. https://www.postgresql.org/docs/9.6/static/catalog-pg-conversion.html

最新回复(0)

PostgreSQL 按拼音排序 - convert to GBKEUC

标签

背景

编码

编码转换

编码转换函数

例子

注意多音字

索引

代码

参考