《Greenplum企业应用实战》一3.3　数据分布

xiaoxiao2023-06-08 133

本节书摘来自华章出版社《Greenplum企业应用实战》一书中的第3章，第3.3节，作者何勇陈晓峰，更多章节内容可以访问云栖社区“华章计算机”公众号查看

3.3　数据分布

由于Greenplum是分布式的架构，为了充分体现分布式架构的优势，我们有必要了解数据是如何分散在各个数据节点上的，有必要了解数据倾斜对数据加载、数据分析、数据导出的影响。

3.3.1　数据分散情况查看

我们来简单做个测试，首先，利用generate_series和repeat函数生成一些测试数据，代码如下：

create table test_distribute_1 as select a as id ,round(random()) as flag , repeat('a',1024) as value from generate_series(1,5000000)a;

500万数据分散在6个数据节点，利用下面这个SQL可以查询数据的分布情况（SQL在后面5.8.4中将会介绍）：

testDB=# select gp_segment_id,count(*) testDB-# from test_distribute_1 testDB-# group by 1; gp_segment_id | count ---------------+-------- 1 | 833396 4 | 833297 5 | 833294 3 | 833309 2 | 833359 0 | 833345 (6 rows)

上述SQL中的group by 1，其中1代表select后面的第一个字段，即gp_segment_id。

3.3.2　数据加载速度影响

接下来将通过实验来测试在分布键不同的情况下数据加载的速度。（1）数据倾斜状态下的数据加载1）测试数据准备，将测试数据导出:

testDB=# copy test_distribute_1 to '/home/gpadmin/data/test_distribute.dat' with delimiter '|'; COPY 5000000

2）建立测试表，以flag字段为分布键：

testDB=# create table test_distribute_2 as select * from test_distribute_1 limit 0 distributed by(flag); SELECT 0

3）执行数据导入：

$ time psql -h localhost -d testDB -c "copy test_distribute_2 from stdin with delimiter '|'" < /home/gpadmin/data/test_distribute.dat real 14m1.381s user 0m24.080s sys 0m26.387s

4）由于分布键flag取值只有0和1，因此数据只能分散到两个数据节点，如下：

testDB=# select gp_segment_id,count(*) from table_distribute_4 group by 1; gp_segment_id | count ---------------+--------- 3 | 2498434 2 | 2501566 (2 rows) Time: 50751.740 ms

5）由于数据分布在2和3节点，对应Primary Segment在dell3、Mirror节点dell4上，可通过以下SQL查询gp_segment_configuration获得：testDB=# select dbid,content,role,port,hostname from gp_segment_configuration

testDB-# where content in(2,3) order by role; dbid | content | role | port | hostname ------+---------+------+-------+---------- 11 | 3 | m | 60001 | dell4 10 | 2 | m | 60000 | dell4 5 | 3 | p | 50001 | dell3 4 | 2 | p | 50000 | dell3 (4 rows)

在执行数据导入期间，Greenplum Performance Monitor页面可监控到：仅有dell3和dell4两台服务器有磁盘和CPU消耗，如图3-7所示。

Greenplum Performance Monitor的安装部署，将在10.2.1—GPmonitor介绍中介绍。（2）数据分布均匀状态下的数据加载1）建立测试表，以id字段为分布键：

testDB=# create table test_distribute_3 as select * from test_distribute_1 limit 0 distributed by(id); SELECT 0

2）执行数据导入：

$ time psql -h localhost -d testDB -c "copy test_distribute_3 from stdin with delimiter '|'" < /home/gpadmin/data/test_distribute.dat real 9m40.607s user 0m24.364s sys 0m24.491s

3）由于分布键id取值顺序分布，因此数据可均匀分散至所有数据节点，如下：

testDB=# select gp_segment_id,count(*) from test_distribute_3 group by 1; gp_segment_id | count ---------------+-------- 1 | 833396 4 | 833297 5 | 833294 3 | 833309 2 | 833359 0 | 833345 (6 rows) Time: 17999.875 ms

在执行数据导入期间，Greenplum Performance Monitor页面可监控到：3台服务器的所有节点都有磁盘和CPU消耗，可见，在数据均匀的情况下，可以利用更多的机器进行工作，性能也比较高，如图3-8所示。

3.3.3　数据查询速度影响

（1）数据倾斜状态下的数据查询

testDB=# select gp_segment_id,count(*),max(length(value)) from test_distribute_2 group by 1; gp_segment_id | count | max ---------------+---------+------ 3 | 2498434 | 1024 2 | 2501566 | 1024 (2 rows) Time: 79840.885 ms

由于数据分布在2和3节点上，即对应dell3和相应的Mirror节点dell4上，但是数据查询只需要Primary节点，故只有dell3节点有磁盘消耗，如图3-9所示。

（2）数据分布均匀状态下的数据查询

testDB=# select gp_segment_id,count(*),max(length(value)) from test_distribute_3 group by 1; gp_segment_id | count | max ---------------+--------+------ 1 | 833396 | 1024 3 | 833309 | 1024 5 | 833294 | 1024 4 | 833297 | 1024 2 | 833359 | 1024 0 | 833345 | 1024 (6 rows) Time: 6976.840 ms

由于数据分布在所有节点上，故所有服务器都有磁盘消耗，从而大大提升了数据查询的性能。

相关资源：Hadoop实战（陆嘉恒）译

最新回复(0)

《Greenplum企业应用实战》一3.3 数据分布

3.3 数据分布

3.3.1 数据分散情况查看

3.3.2 数据加载速度影响

3.3.3 数据查询速度影响

《Greenplum企业应用实战》一3.3　数据分布

3.3　数据分布

3.3.1　数据分散情况查看

3.3.2　数据加载速度影响

3.3.3　数据查询速度影响