阿里云HybridDB for PG实践 - 列存储加字段和默认值

    xiaoxiao2022-06-19  186

    标签

    PostgreSQL , Greenplum , 相对偏移 , 列存储 , appendonly , AO表


    背景

    Greenplum的Append only table支持更新、删除。通过什么支持呢?bitmap文件,标记被删除的行。

    因此在更新,删除后,数据可能膨胀。

    另一方面,列存储每列一个文件,同一行通过偏移对应起来。例如INT8的两个字段,通过偏移很快能找到某一行的A列对应的B列。

    接下来谈谈加字段,在加字段时,AO表示不会REWRITE TABLE的。如果AO表以及有一些垃圾(被删除的数据)记录后,加的字段会使用相对偏移吗(换句话说,需不需要填充已删除的记录),答案是要填充。

    例子

    1、创建3张AO列存表。

    postgres=# create table tbl1 (id int, info text) with (appendonly=true, blocksize=8192, compresstype=none, orientation=column); NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE postgres=# create table tbl2 (id int, info text) with (appendonly=true, blocksize=8192, compresstype=none, orientation=column); NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE postgres=# create table tbl3 (id int, info text) with (appendonly=true, blocksize=8192, compresstype=none, orientation=column); NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE

    2、前两张分别插入1000万记录,最后一张插入2000万记录。

    postgres=# insertinto tbl1 select generate_series(1,10000000),'test'; INSERT 0 10000000 postgres=# insert into tbl2 select generate_series(1,10000000),'test'; INSERT 0 10000000 postgres=# insert into tbl3 select generate_series(1,20000000),'test'; INSERT 0 20000000

    3、分析表,并记录它们的大小

    postgres=# analyze tbl1; ANALYZE postgres=# analyze tbl2; ANALYZE postgres=# analyze tbl3; ANALYZE postgres=# select pg_size_pretty(pg_relation_size('tbl1')); pg_size_pretty ---------------- 88 MB (1 row) postgres=# select pg_size_pretty(pg_relation_size('tbl2')); pg_size_pretty ---------------- 88 MB (1 row) postgres=# select pg_size_pretty(pg_relation_size('tbl3')); pg_size_pretty ---------------- 173 MB (1 row)

    4、更新第一张表,全表更新。并记录更新后的大小,翻了一倍。

    postgres=# update tbl1 set info='test'; UPDATE 10000000 postgres=# analyze tbl1; ANALYZE postgres=# select pg_size_pretty(pg_relation_size('tbl1')); pg_size_pretty ---------------- 173 MB (1 row)

    5、对三个表添加字段,设置默认值。

    postgres=# alter table tbl1 add column c1 int8 default 1; ALTER TABLE postgres=# alter table tbl2 add column c1 int8 default 1; ALTER TABLE postgres=# alter table tbl3 add column c1 int8 default 1; ALTER TABLE

    6、分析表,查看表的大小。

    postgres=# analyze tbl1; ANALYZE postgres=# analyze tbl2; ANALYZE postgres=# analyze tbl3; ANALYZE postgres=# select pg_size_pretty(pg_relation_size('tbl1')); pg_size_pretty ---------------- 325 MB (1 row) postgres=# select pg_size_pretty(pg_relation_size('tbl2')); pg_size_pretty ---------------- 163 MB (1 row) postgres=# select pg_size_pretty(pg_relation_size('tbl3')); pg_size_pretty ---------------- 325 MB (1 row)

    很显然,AO表在添加字段时,以已有文件的记录数为准(即使全部都删除了,也需要在新增字段上初始化这个值),。

    如果有相对偏移量的概念,至少tbl2可以忽略前1000万行的偏移。

    堆表不存在这个问题,因为堆表加字段会REWRITE全表。

    表膨胀的处理

    《如何检测、清理Greenplum垃圾 - 阿里云HybridDB for PG最佳实践》


    最新回复(0)