PostgreSQL , 按拼音排序 , collate
数据库为了支持国际化,通常会涉及到collate, ctype的概念。
初始化数据库集群时,可以设置如下参数,用于设置数据库的字符串排序、字符归类方法、数值\日期\时间\货币的格式等。
LC_COLLATEString sort orderLC_CTYPECharacter classification (What is a letter? Its upper-case equivalent?)LC_MESSAGESLanguage of messagesLC_MONETARYFormatting of currency amountsLC_NUMERICFormatting of numbersLC_TIMEFormatting of dates and times用户可以利用这些特性,按本土化需求,输出对应的顺序或者格式。
按中文的拼音为顺序排序就是一个常见的需求。
用户可以参考PostgreSQL的官方文档,有对应的字符集支持列表
https://www.postgresql.org/docs/9.6/static/multibyte.html
Server=Yes表示该字符集支持用于create database。否则只支持作为客户端字符集。
NameDescriptionLanguageServer?Bytes/CharAliasesBIG5Big FiveTraditional ChineseNo1-2WIN950, Windows950EUC_CNExtended UNIX Code-CNSimplified ChineseYes1-3-EUC_JPExtended UNIX Code-JPJapaneseYes1-3-EUC_JIS_2004Extended UNIX Code-JP, JIS X 0213JapaneseYes1-3-EUC_KRExtended UNIX Code-KRKoreanYes1-3-EUC_TWExtended UNIX Code-TWTraditional Chinese, TaiwaneseYes1-3-GB18030National StandardChineseNo1-4-GBKExtended National StandardSimplified ChineseNo1-2WIN936, Windows936ISO_8859_5ISO 8859-5, ECMA 113Latin/CyrillicYes1-ISO_8859_6ISO 8859-6, ECMA 114Latin/ArabicYes1-ISO_8859_7ISO 8859-7, ECMA 118Latin/GreekYes1-ISO_8859_8ISO 8859-8, ECMA 121Latin/HebrewYes1-JOHABJOHABKorean (Hangul)No1-3-KOI8RKOI8-RCyrillic (Russian)Yes1KOI8KOI8UKOI8-UCyrillic (Ukrainian)Yes1-LATIN1ISO 8859-1, ECMA 94Western EuropeanYes1ISO88591LATIN2ISO 8859-2, ECMA 94Central EuropeanYes1ISO88592LATIN3ISO 8859-3, ECMA 94South EuropeanYes1ISO88593LATIN4ISO 8859-4, ECMA 94North EuropeanYes1ISO88594LATIN5ISO 8859-9, ECMA 128TurkishYes1ISO88599LATIN6ISO 8859-10, ECMA 144NordicYes1ISO885910LATIN7ISO 8859-13BalticYes1ISO885913LATIN8ISO 8859-14CelticYes1ISO885914LATIN9ISO 8859-15LATIN1 with Euro and accentsYes1ISO885915LATIN10ISO 8859-16, ASRO SR 14111RomanianYes1ISO885916MULE_INTERNALMule internal codeMultilingual EmacsYes1-4-SJISShift JISJapaneseNo1-2Mskanji, ShiftJIS, WIN932, Windows932SHIFT_JIS_2004Shift JIS, JIS X 0213JapaneseNo1-2-SQL_ASCIIunspecified (see text)anyYes1-UHCUnified Hangul CodeKoreanNo1-2WIN949, Windows949UTF8Unicode, 8-bitallYes1-4UnicodeWIN866Windows CP866CyrillicYes1ALTWIN874Windows CP874ThaiYes1-WIN1250Windows CP1250Central EuropeanYes1-WIN1251Windows CP1251CyrillicYes1WINWIN1252Windows CP1252Western EuropeanYes1-WIN1253Windows CP1253GreekYes1-WIN1254Windows CP1254TurkishYes1-WIN1255Windows CP1255HebrewYes1-WIN1256Windows CP1256ArabicYes1-WIN1257Windows CP1257BalticYes1-WIN1258Windows CP1258VietnameseYes1ABC, TCVN, TCVN5712, VSCII使用如下SQL可以查询系统表pg_collation得到字符集支持的lc_collate和lc_ctype。
其中encoding为空时,表示这个collation支持所有的字符集。
test=> select pg_encoding_to_char(collencoding) as encoding,collname,collcollate,collctype from pg_collation ; encoding | collname | collcollate | collctype ------------+-----------------------+-----------------------+----------------------- | default | | | C | C | C | POSIX | POSIX | POSIX UTF8 | aa_DJ | aa_DJ.utf8 | aa_DJ.utf8 LATIN1 | aa_DJ | aa_DJ | aa_DJ LATIN1 | aa_DJ.iso88591 | aa_DJ.iso88591 | aa_DJ.iso88591 UTF8 | aa_DJ.utf8 | aa_DJ.utf8 | aa_DJ.utf8 UTF8 | aa_ER | aa_ER | aa_ER UTF8 | aa_ER.utf8 | aa_ER.utf8 | aa_ER.utf8 ....... EUC_CN | zh_CN | zh_CN | zh_CN UTF8 | zh_CN | zh_CN.utf8 | zh_CN.utf8 EUC_CN | zh_CN.gb2312 | zh_CN.gb2312 | zh_CN.gb2312 UTF8 | zh_CN.utf8 | zh_CN.utf8 | zh_CN.utf8 UTF8 | zh_HK | zh_HK.utf8 | zh_HK.utf8 UTF8 | zh_HK.utf8 | zh_HK.utf8 | zh_HK.utf8 EUC_CN | zh_SG | zh_SG | zh_SG UTF8 | zh_SG | zh_SG.utf8 | zh_SG.utf8 EUC_CN | zh_SG.gb2312 | zh_SG.gb2312 | zh_SG.gb2312 UTF8 | zh_SG.utf8 | zh_SG.utf8 | zh_SG.utf8 EUC_TW | zh_TW | zh_TW.euctw | zh_TW.euctw UTF8 | zh_TW | zh_TW.utf8 | zh_TW.utf8 EUC_TW | zh_TW.euctw | zh_TW.euctw | zh_TW.euctw UTF8 | zh_TW.utf8 | zh_TW.utf8 | zh_TW.utf8 UTF8 | zu_ZA | zu_ZA.utf8 | zu_ZA.utf8 LATIN1 | zu_ZA | zu_ZA | zu_ZA LATIN1 | zu_ZA.iso88591 | zu_ZA.iso88591 | zu_ZA.iso88591 UTF8 | zu_ZA.utf8 | zu_ZA.utf8 | zu_ZA.utf8 (869 rows)《如何设置数据库的LC_COLLATE, LC_CTYPE, ENCODING, TEMPLATE》
在操作前,请了解清楚与您当前数据库字符集(encoding)兼容的collate,使用如下SQL可以得到当前数据库的encoding
postgres=# select datname,pg_encoding_to_char(encoding) as encoding from pg_database; datname | encoding --------------------+----------- template1 | UTF8 template0 | UTF8 db | SQL_ASCII db1 | EUC_CN contrib_regression | UTF8 test01 | UTF8 test02 | UTF8 postgres | UTF8 (8 rows)1. 在创建表时,指定兼容当前字符集的collate
CREATE TABLE test1 ( a text COLLATE "de_DE", b text COLLATE "es_ES", ... );2. 修改列collate(会导致rewrite table),大表请谨慎操作
alter table a alter c1 type text COLLATE "zh_CN";1. 使用本土化, 改变order by输出排序
test=# select * from a order by c1 collate "C"; c1 -------- 刘少奇 刘德华 (2 rows) test=# select * from a order by c1 collate "zh_CN"; c1 -------- 刘德华 刘少奇 (2 rows)2. 使用本土化, 改变操作符的结果
test=# select * from a where c1 > '刘少奇' collate "C"; c1 -------- 刘德华 (1 row) test=# select * from a where c1 > '刘少奇' collate "zh_CN"; c1 ---- (0 rows)注意排序语句中的collate与索引的collate保持一致,才能使用这个索引进行排序。
postgres=# create index idxa on a(c1 collate "zh_CN"); CREATE INDEX postgres=# explain select * from a order by c1 collate "zh_CN"; QUERY PLAN ------------------------------------------------------------------------ Index Only Scan using idxa on a (cost=0.15..31.55 rows=1360 width=64) (1 row)1. 方法1,使用本土化SQL(不修改原有数据)
test=# select * from a order by c1 collate "zh_CN"; c1 -------- 刘德华 刘少奇 (2 rows)2. 方法2,使用本土化字段(如果已有数据,则需要调整原有数据)
alter table a alter c1 type text COLLATE "zh_CN";3. 方法3,使用本土化索引以及本土化SQL(不修改原有数据)
postgres=# create index idxa on a(c1 collate "zh_CN"); CREATE INDEX postgres=# explain select * from a order by c1 collate "zh_CN"; QUERY PLAN ------------------------------------------------------------------------ Index Only Scan using idxa on a (cost=0.15..31.55 rows=1360 width=64) (1 row)4. 设置数据库的collate为zh_CN,将默认使用这个collate,按拼音排序
test02=# create database test03 encoding 'UTF8' lc_collate 'zh_CN.utf8' lc_ctype 'zh_CN.utf8' template template0; CREATE DATABASE test02=# \c test03 You are now connected to database "test03" as user "postgres". test03=# select * from (values ('刘德华'),('刘少奇')) as a(c1) order by c1 ; c1 -------- 刘德华 刘少奇 (2 rows)有些多音字,例如重庆(chongqing), 编码时"重"可能是按zhong编码,影响输出。
test03=# select * from (values ('中山'),('重庆')) as a(c1) order by c1 collate "zh_CN"; c1 ------ 中山 重庆 (2 rows)greenplum不支持单列设置collate,按拼音排序有些许不同。
在greenplum中,可以使用字符集转换,按对应二进制排序,得到拼音排序的效果。
postgres=# select * from (values ('刘德华'), ('刘少奇')) t(id) order by byteain(textout(convert(id,'UTF8','EUC_CN'))); id -------- 刘德华 刘少奇 (2 rows)《PostgreSQL 按拼音排序 - convert to GBK/EUC_CN coding》
《如何设置数据库的LC_COLLATE, LC_CTYPE, ENCODING, TEMPLATE》
https://www.postgresql.org/docs/9.6/static/charset.html
相关资源:七夕情人节表白HTML源码(两款)