PostgreSQL , conversion , pg_conversion , 拼音 , 编码转换 , convert , convert_to , convert_from
国内的应用,在文本排序上基本都是按照拼音来进行排序的。
在不同的字符集中,汉字的编码可能不一样,比如UTF8和GBK,其中GBK是按拼音的顺序进行编码的,而UTF8则不是。
所以如果你的数据库使用了UTF8编码,对中文字段进行排序时,可能得到的并不是按拼音排序的结果。
在PostgreSQL中,中文按拼音排序的编码包括GB18030, EUC_CN, GBK, BIG5, EUC_TW 等。
为了得到拼音排序,可以使用编码转换后的值来排序,索引也可以使用编码转换的表达式索引。
PostgreSQL支持的编码如下
https://www.postgresql.org/docs/9.6/static/multibyte.html
PostgreSQL Character Sets
NameDescriptionLanguageServer?Bytes/CharAliasesBIG5Big FiveTraditional ChineseNo1-2WIN950, Windows950EUC_CNExtended UNIX Code-CNSimplified ChineseYes1-3-EUC_JPExtended UNIX Code-JPJapaneseYes1-3-EUC_JIS_2004Extended UNIX Code-JP, JIS X 0213JapaneseYes1-3-EUC_KRExtended UNIX Code-KRKoreanYes1-3-EUC_TWExtended UNIX Code-TWTraditional Chinese, TaiwaneseYes1-3-GB18030National StandardChineseNo1-4-GBKExtended National StandardSimplified ChineseNo1-2WIN936, Windows936ISO_8859_5ISO 8859-5, ECMA 113Latin/CyrillicYes1-ISO_8859_6ISO 8859-6, ECMA 114Latin/ArabicYes1-ISO_8859_7ISO 8859-7, ECMA 118Latin/GreekYes1-ISO_8859_8ISO 8859-8, ECMA 121Latin/HebrewYes1-JOHABJOHABKorean (Hangul)No1-3-KOI8RKOI8-RCyrillic (Russian)Yes1KOI8KOI8UKOI8-UCyrillic (Ukrainian)Yes1-LATIN1ISO 8859-1, ECMA 94Western EuropeanYes1ISO88591LATIN2ISO 8859-2, ECMA 94Central EuropeanYes1ISO88592LATIN3ISO 8859-3, ECMA 94South EuropeanYes1ISO88593LATIN4ISO 8859-4, ECMA 94North EuropeanYes1ISO88594LATIN5ISO 8859-9, ECMA 128TurkishYes1ISO88599LATIN6ISO 8859-10, ECMA 144NordicYes1ISO885910LATIN7ISO 8859-13BalticYes1ISO885913LATIN8ISO 8859-14CelticYes1ISO885914LATIN9ISO 8859-15LATIN1 with Euro and accentsYes1ISO885915LATIN10ISO 8859-16, ASRO SR 14111RomanianYes1ISO885916MULE_INTERNALMule internal codeMultilingual EmacsYes1-4SJISShift JISJapaneseNo1-2Mskanji, ShiftJIS, WIN932, Windows932SHIFT_JIS_2004Shift JIS, JIS X 0213JapaneseNo1-2-SQL_ASCIIunspecified (see text)anyYes1-UHCUnified Hangul CodeKoreanNo1-2WIN949, Windows949UTF8Unicode, 8-bitallYes1-4UnicodeWIN866Windows CP866CyrillicYes1ALTWIN874Windows CP874ThaiYes1-WIN1250Windows CP1250Central EuropeanYes1-WIN1251Windows CP1251CyrillicYes1WINWIN1252Windows CP1252Western EuropeanYes1-WIN1253Windows CP1253GreekYes1-WIN1254Windows CP1254TurkishYes1-WIN1255Windows CP1255HebrewYes1-WIN1256Windows CP1256ArabicYes1-WIN1257Windows CP1257BalticYes1-WIN1258Windows CP1258VietnameseYes1ABC, TCVN, TCVN5712, VSCII与中文编码排序相关的包括 GB18030, EUC_CN, GBK, BIG5, EUC_TW
简体常用的包括GBK, EUC_CN。
在PostgreSQL中,如果要将字符从一个编码转换为另一个编码,需要告诉数据库(create conversion)怎么转换(使用什么C函数),PG内置了一些转换的C函数和转换方法。
https://www.postgresql.org/docs/9.6/static/catalog-pg-conversion.html
pg_conversion
NameTypeReferencesDescriptionoidoid-Row identifier (hidden attribute; must be explicitly selected)connamename-Conversion name (unique within a namespace)connamespaceoidpg_namespace.oidThe OID of the namespace that contains this conversionconowneroidpg_authid.oidOwner of the conversionconforencodingint4-Source encoding IDcontoencodingint4-Destination encoding IDconprocregprocpg_proc.oidConversion procedurecondefaultbool-True if this is the default conversion查看内置的转换方法
可以看到utf8转换为中文编码的都支持了
postgres=> select * from pg_conversion where conname ~* 'gbk|gb18|euc_cn|euc_tw|big5' order by 1; conname | connamespace | conowner | conforencoding | contoencoding | conproc | condefault -----------------+--------------+----------+----------------+---------------+-----------------+------------ big5_to_euc_tw | 11 | 10 | 36 | 4 | big5_to_euc_tw | t big5_to_mic | 11 | 10 | 36 | 7 | big5_to_mic | t big5_to_utf8 | 11 | 10 | 36 | 6 | big5_to_utf8 | t euc_cn_to_mic | 11 | 10 | 2 | 7 | euc_cn_to_mic | t euc_cn_to_utf8 | 11 | 10 | 2 | 6 | euc_cn_to_utf8 | t euc_tw_to_big5 | 11 | 10 | 4 | 36 | euc_tw_to_big5 | t euc_tw_to_mic | 11 | 10 | 4 | 7 | euc_tw_to_mic | t euc_tw_to_utf8 | 11 | 10 | 4 | 6 | euc_tw_to_utf8 | t gb18030_to_utf8 | 11 | 10 | 39 | 6 | gb18030_to_utf8 | t gbk_to_utf8 | 11 | 10 | 37 | 6 | gbk_to_utf8 | t mic_to_big5 | 11 | 10 | 7 | 36 | mic_to_big5 | t mic_to_euc_cn | 11 | 10 | 7 | 2 | mic_to_euc_cn | t mic_to_euc_tw | 11 | 10 | 7 | 4 | mic_to_euc_tw | t utf8_to_big5 | 11 | 10 | 6 | 36 | utf8_to_big5 | t utf8_to_euc_cn | 11 | 10 | 6 | 2 | utf8_to_euc_cn | t utf8_to_euc_tw | 11 | 10 | 6 | 4 | utf8_to_euc_tw | t utf8_to_gb18030 | 11 | 10 | 6 | 39 | utf8_to_gb18030 | t utf8_to_gbk | 11 | 10 | 6 | 37 | utf8_to_gbk | t (18 rows)注意数据库版本
PostgreSQL 8.x(如Greenplum), 将字符串从原编码转换为指定编码的字符串返回。
可能存在显示的问题。
List of functions Schema | Name | Result data type | Argument data types | Type ------------+---------------+------------------+---------------------+-------- pg_catalog | convert | text | text, name | normal pg_catalog | convert | text | text, name, name | normalPostgreSQL 9.x, 将源编码字符串的字节流转换为指定编码的字符串的字节流返回。
避免了显示的问题。
List of functions Schema | Name | Result data type | Argument data types | Type ------------+--------------+------------------+---------------------+-------- pg_catalog | convert | bytea | bytea, name, name | normal pg_catalog | convert_from | text | bytea, name | normal pg_catalog | convert_to | bytea | text, name | normal如果8.x需要避免显示问题,返回字节流,可以这样使用,推荐使用。
byteain(textout(convert(字符,'源编码','目标编码')))当前数据库编码为UTF-8,中文排序未按拼音排序。
postgres=# \l List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges -----------+----------+----------+---------+-------+----------------------- db0 | postgres | UTF8 | C | C | postgres | postgres | UTF8 | C | C | template0 | postgres | UTF8 | C | C | =c/postgres + | | | | | postgres=CTc/postgres template1 | postgres | UTF8 | C | C | =c/postgres + | | | | | postgres=CTc/postgres (4 rows) postgres=# select * from (values ('刘德华'), ('刘少奇')) t(id) order by id; id -------- 刘少奇 刘德华 (2 rows)按拼音排序方法(目标可以改成EUC_CN)
8.x postgres=> select * from (values ('刘德华'), ('刘少奇')) t(id) order by byteain(textout(convert(id,'UTF-8','GBK'))); id -------- 刘德华 刘少奇 (2 rows) 9.x postgres=# select * from (values ('刘德华'), ('刘少奇')) t(id) order by convert(id::bytea,'UTF-8','GBK'); id -------- 刘德华 刘少奇 (2 rows)中文有一些多音字,比如重庆(chongqing), 但是编码时它可能是按zhong编码的,所以看这个例子。
postgres=> select * from (values ('中山'), ('重庆')) t(id) order by byteain(textout(convert(id,'UTF-8','GBK'))); id ------ 中山 重庆 (2 rows)表达式索引即可,使用immutable function.
src/backend/utils/mb/mbutils.c
/* * Convert string between two arbitrary encodings. * * BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name) */ Datum pg_convert(PG_FUNCTION_ARGS) { bytea *string = PG_GETARG_BYTEA_PP(0); char *src_encoding_name = NameStr(*PG_GETARG_NAME(1)); int src_encoding = pg_char_to_encoding(src_encoding_name); char *dest_encoding_name = NameStr(*PG_GETARG_NAME(2)); int dest_encoding = pg_char_to_encoding(dest_encoding_name); const char *src_str; char *dest_str; bytea *retval; int len; if (src_encoding < 0) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("invalid source encoding name \"%s\"", src_encoding_name))); if (dest_encoding < 0) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("invalid destination encoding name \"%s\"", dest_encoding_name))); /* make sure that source string is valid */ len = VARSIZE_ANY_EXHDR(string); src_str = VARDATA_ANY(string); pg_verify_mbstr_len(src_encoding, src_str, len, false); /* perform conversion */ dest_str = (char *) pg_do_encoding_conversion((unsigned char *) src_str, len, src_encoding, dest_encoding); /* update len if conversion actually happened */ if (dest_str != src_str) len = strlen(dest_str); /* * build bytea data type structure. */ retval = (bytea *) palloc(len + VARHDRSZ); SET_VARSIZE(retval, len + VARHDRSZ); memcpy(VARDATA(retval), dest_str, len); if (dest_str != src_str) pfree(dest_str); /* free memory if allocated by the toaster */ PG_FREE_IF_COPY(string, 0); PG_RETURN_BYTEA_P(retval); }1. https://www.postgresql.org/docs/9.6/static/multibyte.html#AEN39011
2. https://www.postgresql.org/docs/9.6/static/sql-createconversion.html
3. https://www.postgresql.org/docs/9.6/static/catalog-pg-conversion.html