PostgreSQL on ECS多云盘的部署、快照备份和恢复

    xiaoxiao2022-06-22  193

    标签

    PostgreSQL , ECS , 云盘 , 快照 , 一致性备份 , 时间点恢复 , zfs , lvm , raid , 并行计算


    背景

    随着阿里云云盘的发展,云盘的性能已经越来越好了。IOPS可以做到十万以上,读写吞吐也超过1GB/s了。相信随着网络的发展,SSD云盘IOPS突破40万,读写吞吐突破4GB/s也不远了。

    不过这里的IOPS和吞吐是指并发的IO能力,单次IO的延迟与本地还是不能比(将来RDMA网络也许能解决这个问题)。

    PostgreSQL 如何解决SSD云盘IO延迟高缺陷

    某些业务对数据库的(要求响应时间很快的写小事务)对单次IO延迟比较敏感,不过PostgreSQL有方法可以解决这个小缺陷。

    1、后台IO,(write syscall)

    PostgreSQL 的大多数IO为后台IO(bgwriter, backend writer),所以刷shared buffer对IO延迟不敏感。

    2、异步提交

    事务提交时,写WAL异步提交。不会造成数据不一致,但是当数据库CRASH,可能丢失在wal buffer中未提交的事务(最多10毫秒)。

    这种方法是最有效的。

    3、组提交

    组提交,解决WAL写瓶颈,将多个同时提交的事务的WAL fsync动作合并为单次,从而减少FSYNC次数,提高高并发时的写小事务的TPS。

    PostgreSQL 如何充分利用并发IO的能力

    由于我们看到的云盘IOPS和读写吞吐指标是并发指标,数据库如何利用好这么好的指标呢?对高并发小事务不是问题,肯定是能将它用起来的,但是对于低并发,长事务(分析型业务),如何利用云盘的IOPS能力和读写吞吐能力呢?

    PostgreSQL的并行计算特性可以充分利用云盘的并发IOPS和读写带宽。

    多云盘卷组

    单块云盘的IOPS能力有上限,容量有上限,读写带宽也有上限。好在ECS支持多块云盘,目前已支持一台ECS挂载16块云盘。

    通过多块云盘,组卷条带后,提高读写带宽。以Linux RHEL/CentOS 7.x为例,组卷和条带方法:

    1、逻辑卷

    yum install lvm2

    2、软RAID

    yum install -y mdadm

    3、ZFS

    http://zfsonlinux.org/

    以centos 7.3为例 wget http://download.zfsonlinux.org/epel/zfs-release.el7_3.noarch.rpm rpm -ivh zfs-release.el7_3.noarch.rpm yum install -y zfs

    逻辑卷例子

    假设环境中有16块SSD云盘

    1、创建PV

    pvcreate /dev/vd[b-q]

    2、创建VG

    vgcreate -A y -s 128M vgdata01 /dev/vd[b-q] -s, --physicalextentsize PhysicalExtentSize[bBsSkKmMgGtTpPeE] Sets the physical extent size on physical volumes of this volume group. A size suffix (k for kilobytes up to t for terabytes) is optional, megabytes is the default if no suffix is present. For LVM2 format, the value must be a power of 2 of at least 1 sector (where the sector size is the largest sector size of the PVs currently used in the VG) or, if not a power of 2, at least 128KiB. For the older LVM1 format, it must be a power of 2 of at least 8KiB. The default is 4 MiB. Once this value has been set, it is difficult to change it without recreating the volume group which would involve backing up and restoring data on any logical volumes. However, if no extents need moving for the new value to apply, it can be altered using vgchange -s. If the volume group metadata uses lvm1 format, extents can vary in size from 8KiB to 16GiB and there is a limit of 65534 extents in each logical volume. The default of 4 MiB leads to a maximum logical volume size of around 256GiB. If the volume group metadata uses lvm2 format those restrictions do not apply, but having a large number of extents will slow down the tools but have no impact on I/O performance to the logical volume. The smallest PE is 1KiB The 2.4 kernel has a limitation of 2TiB per block device.

    3、创建LV,设置条带

    16块盘,每个条带单位为8KB。

    lvcreate -A y -i 16 -I 8 -l 100%VG -n lv01 vgdata01 -i|--stripes Stripes Gives the number of stripes. This is equal to the number of physical volumes to scatter the logical volume data. When creating a RAID 4/5/6 logical volume, the extra devices which are necessary for parity are internally accounted for. Specifying -i 3 would cause 3 devices for striped and RAID 0 logical volumes, 4 devices for RAID 4/5, 5 devices for RAID 6 and 6 devices for RAID 10. Alternatively, RAID 0 will stripe across 2 devices, RAID 4/5 across 3 PVs, RAID 6 across 5 PVs and RAID 10 across 4 PVs in the volume group if the -i argument is omitted. In order to stripe across all PVs of the VG if the -i argument is omitted, set raid_stripe_all_devices=1 in the allocation section of lvm.conf (5) or add --config allocation/raid_stripe_all_devices=1 to the command. Note the current maxima for stripes depend on the created RAID type. For raid10, the maximum of stripes is 32, for raid0, it is 64, for raid4/5, it is 63 and for raid6 it is 62. See the --nosync option to optionally avoid initial syncrhonization of RaidLVs. Two implementations of basic striping are available in the kernel. The original device-mapper implementation is the default and should normally be used. The alternative implementation using MD, available since ver‐ sion 1.7 of the RAID device-mapper kernel target (kernel version 4.2) is provided to facilitate the development of new RAID features. It may be accessed with --type raid0[_meta], but is best avoided at present because of assorted restrictions on resizing and converting such devices. -I|--stripesize StripeSize Gives the number of kilobytes for the granularity of the stripes. StripeSize must be 2^n (n = 2 to 9) for metadata in LVM1 format. For metadata in LVM2 format, the stripe size may be a larger power of 2 but must not exceed the physical extent size.

    4、创建文件系统,设置条带

    当数据库数据块=32K时,chunk大小(单块盘读写32KB,再写下一块)。 条带大小512KB(32KB*16)。 mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=8,stripe_width=128 -b 4096 -T largefile -L lv01 或 当数据库数据块=8K时,chunk大小(单块盘读写8KB,再写下一块)。 条带大小128KB(8KB*16)。 mkfs.ext4 /dev/mapper/vgdata01-lv01 -m 0 -O extent,uninit_bg -E lazy_itable_init=1,stride=2,stripe_width=32 -b 4096 -T largefile -L lv01 -b block-size Specify the size of blocks in bytes. Valid block-size values are 1024, 2048 and 4096 bytes per block. If omitted, block-size is heuristically determined by the filesystem size and the expected usage of the filesys‐ tem (see the -T option). If block-size is preceded by a negative sign ('-'), then mke2fs will use heuristics to determine the appropriate block size, with the constraint that the block size will be at least block- size bytes. This is useful for certain hardware devices which require that the blocksize be a multiple of 2k. stride=stride-size 单位为blocks Configure the filesystem for a RAID array with stride-size filesystem blocks. This is the number of blocks read or written to disk before moving to the next disk, which is sometimes referred to as the chunk size. This mostly affects placement of filesystem metadata like bitmaps at mke2fs time to avoid placing them on a single disk, which can hurt performance. It may also be used by the block allocator. stripe_width=stripe-width 单位为blocks Configure the filesystem for a RAID array with stripe-width filesystem blocks per stripe. This is typically stride-size * N, where N is the number of data-bearing disks in the RAID (e.g. for RAID 5 there is one parity disk, so N will be the number of disks in the array minus 1). This allows the block allocator to prevent read-modify-write of the parity in a RAID stripe if possible when the data is written. lazy_itable_init[= <0 to disable, 1 to enable>] If enabled and the uninit_bg feature is enabled, the inode table will not be fully initialized by mke2fs. This speeds up filesystem initialization noticeably, but it requires the kernel to finish initial‐ izing the filesystem in the background when the filesystem is first mounted. If the option value is omitted, it defaults to 1 to enable lazy inode table zeroing.

    5、MOUNT文件系统

    如果不使用云盘镜像备份,使用这种挂载模式。 mount -o defaults,noatime,nodiratime,nodelalloc,barrier=0,data=writeback LABEL=lv01 /data01 或 如果使用云盘镜像备份,使用这种挂载模式。 mount -o defaults,noatime,nodiratime,nodelalloc,barrier=1,data=ordered LABEL=lv01 /data01

    6、FIO测试

    yum install -y fio fio -filename=/data01/testdir -direct=1 -thread -rw=write -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_write.log 2>&1 fio -filename=/data01/testdir -direct=1 -thread -rw=read -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_read.log 2>&1 fio -filename=/data01/testdir -direct=1 -thread -rw=randwrite -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_randwrite.log 2>&1 fio -filename=/data01/testdir -direct=1 -thread -rw=randread -ioengine=libaio -bs=8K -size=16G -numjobs=128 -runtime=60 -group_reporting -name=mytest >/tmp/fio_randread.log 2>&1 顺序 READ: io=72621MB, aggrb=1210.3MB/s, minb=1210.3MB/s, maxb=1210.3MB/s, mint=60003msec, maxt=60003msec WRITE: io=36845MB, aggrb=628743KB/s, minb=628743KB/s, maxb=628743KB/s, mint=60007msec, maxt=60007msec 随机 READ: io=53390MB, aggrb=911160KB/s, minb=911160KB/s, maxb=911160KB/s, mint=60002msec, maxt=60002msec WRITE: io=26078MB, aggrb=445032KB/s, minb=445032KB/s, maxb=445032KB/s, mint=60004msec, maxt=60004msec

    数据库部署

    1、安装软件

    《PostgreSQL on Linux 最佳部署手册》

    2、初始化数据库

    initdb -E SQL_ASCII -U postgres --locale=C -D $PGDATA

    3、配置postgresql.conf

    port = 1921 max_connections = 1000 superuser_reserved_connections = 13 unix_socket_directories = '.' shared_buffers = 64GB maintenance_work_mem = 1GB dynamic_shared_memory_type = posix vacuum_cost_delay = 0 bgwriter_delay = 10ms bgwriter_lru_maxpages = 500 bgwriter_lru_multiplier = 5.0 bgwriter_flush_after = 0 effective_io_concurrency = 16 max_worker_processes = 128 max_parallel_workers_per_gather = 64 max_parallel_workers = 128 backend_flush_after = 0 wal_level = minimal synchronous_commit = off wal_sync_method = fsync full_page_writes = on wal_buffers = 1GB wal_writer_delay = 10ms wal_writer_flush_after = 0 checkpoint_timeout = 35min max_wal_size = 128GB min_wal_size = 8GB checkpoint_completion_target = 0.5 checkpoint_flush_after = 0 checkpoint_warning = 30s max_wal_senders = 0 random_page_cost = 1.2 parallel_tuple_cost = 0 parallel_setup_cost = 0 min_parallel_table_scan_size = 0 min_parallel_index_scan_size = 0 effective_cache_size = 200GB log_destination = 'csvlog' logging_collector = on log_directory = 'log' log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' log_file_mode = 0600 log_truncate_on_rotation = on log_checkpoints = on log_connections = on log_disconnections = on log_error_verbosity = verbose log_timezone = 'PRC' log_autovacuum_min_duration = 0 autovacuum_max_workers = 8 autovacuum_vacuum_cost_delay = 0 autovacuum_vacuum_cost_limit = 0 datestyle = 'iso, mdy' timezone = 'PRC' lc_messages = 'C' lc_monetary = 'C' lc_numeric = 'C' lc_time = 'C' default_text_search_config = 'pg_catalog.english' max_locks_per_transaction = 6400 pg_ctl start

    4、配置归档,归档目录不能放在数据盘的云盘里面,你可以放到其他云盘,或者放到OSS。

    这里为了演示方便,放在本地云盘。(保证与数据云盘不是同一云盘)

    mkdir /archivedir chown digoal:digoal /archivedir vi postgresql.conf wal_level = replica archive_mode = on archive_command = 'test ! -f /archivedir/%f && cp %p /archivedir/%f' pg_ctl restart -m fast

    5、压测

    5.1、写入大吞吐压测

    初始化

    -- 建表 do language plpgsql $$ declare begin for i in 0..65535 loop execute 'create table IF NOT EXISTS test_'||i||'(id int, info text, crt_time timestamp) with (autovacuum_enabled=off, toast.autovacuum_enabled=off)'; end loop; end; $$; -- 删表 do language plpgsql $$ declare begin for i in 0..65535 loop execute 'drop table IF EXISTS test_'||i; end loop; end; $$; -- 写表 create or replace function batch_ins(int) returns void as $$ declare begin execute 'insert into test_'||$1||' select generate_series(1,10000), md5(random()::text), now()'; end; $$ language plpgsql strict;

    大吞吐写入测试,写入速度直逼200万行/s

    vi test.sql \set id random(0,65535) select batch_ins(:id); pgbench -M prepared -n -r -P 1 -f ./test.sql -c 32 -j 32 -T 120 transaction type: ./test.sql scaling factor: 1 query mode: prepared number of clients: 32 number of threads: 32 duration: 120 s number of transactions actually processed: 20801 latency average = 184.678 ms latency stddev = 57.522 ms tps = 173.218345 (including connections establishing) tps = 173.233151 (excluding connections establishing) script statistics: - statement latencies in milliseconds: 0.003 \set id random(0,65535) 184.684 select batch_ins(:id);

    实际为写WAL的瓶颈,采用UNLOGGED TABLE,写入速度直逼1000万行/s。

    transaction type: ./test.sql scaling factor: 1 query mode: prepared number of clients: 32 number of threads: 32 duration: 120 s number of transactions actually processed: 117367 latency average = 32.731 ms latency stddev = 23.047 ms tps = 977.228590 (including connections establishing) tps = 977.331826 (excluding connections establishing) script statistics: - statement latencies in milliseconds: 0.002 \set id random(0,65535) 32.727 select batch_ins(:id);

    5.2、10亿数据量,OLTP压测

    pgbench -i -s 10000 pgbench -M prepared -n -r -P 1 -c 64 -j 64 -T 120 transaction type: <builtin: TPC-B (sort of)> scaling factor: 10000 query mode: prepared number of clients: 64 number of threads: 64 duration: 120 s number of transactions actually processed: 580535 latency average = 13.229 ms latency stddev = 30.073 ms tps = 4829.300799 (including connections establishing) tps = 4829.989470 (excluding connections establishing) script statistics: - statement latencies in milliseconds: 0.003 \set aid random(1, 100000 * :scale) 0.001 \set bid random(1, 1 * :scale) 0.001 \set tid random(1, 10 * :scale) 0.001 \set delta random(-5000, 5000) 0.097 BEGIN; 5.650 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; 0.153 SELECT abalance FROM pgbench_accounts WHERE aid = :aid; 3.122 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; 1.631 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; 1.418 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); 1.148 END;

    5.3、并行读压测

    postgres=# \dt+ e List of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+----------+-------+------------- public | e | table | postgres | 24 GB | (1 row) -- 32个并行读,24G数据处理耗时0.3秒。 postgres=# alter table e set (parallel_workers =32); postgres=# set max_parallel_workers_per_gather =32; postgres=# set min_parallel_table_scan_size =0; postgres=# set min_parallel_index_scan_size =0; postgres=# set parallel_setup_cost =0; postgres=# set parallel_tuple_cost =0; postgres=# set force_parallel_mode =on; postgres=# select count(*) from e; count ---------- 28766181 (1 row) Time: 993.876 ms postgres=# explain (analyze,verbose,timing,costs,buffers) select count(*) from e; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=178479.59..178479.60 rows=1 width=8) (actual time=305.796..305.796 rows=1 loops=1) Output: count(*) Buffers: shared hit=27422 -> Gather (cost=178479.50..178479.51 rows=32 width=8) (actual time=304.868..305.784 rows=33 loops=1) Output: (PARTIAL count(*)) Workers Planned: 32 Workers Launched: 32 Buffers: shared hit=27422 -> Partial Aggregate (cost=178479.50..178479.51 rows=1 width=8) (actual time=291.483..291.483 rows=1 loops=33) Output: PARTIAL count(*) Buffers: shared hit=23934 Worker 0: actual time=281.287..281.287 rows=1 loops=1 Buffers: shared hit=718 .......... Worker 30: actual time=299.304..299.304 rows=1 loops=1 Buffers: shared hit=438 Worker 31: actual time=300.227..300.227 rows=1 loops=1 Buffers: shared hit=460 -> Parallel Index Only Scan using uk_e on public.e (cost=0.44..176232.78 rows=898689 width=0) (actual time=0.085..178.852 rows=871702 loops=33) Heap Fetches: 0 Buffers: shared hit=23934 Worker 0: actual time=0.078..173.121 rows=1014806 loops=1 Buffers: shared hit=718 .......... Worker 30: actual time=0.089..184.251 rows=635007 loops=1 Buffers: shared hit=438 Worker 31: actual time=0.070..184.000 rows=649729 loops=1 Buffers: shared hit=460 Planning time: 0.092 ms Execution time: 314.527 ms (144 rows)

    基于云盘镜像的备份

    备份过程

    1、开始备份

    select pg_start_backup('xx');

    2、打云盘快照

    调用API或在阿里云ECS控制台执行。

    3、压测

    打快照的过程中,产生一些数据库变更。

    使用前面的例子,大吞吐写入测试(使用logged table)、OLTP测试、夹杂建表、删表。

    4、云盘快照结束

    5、停止压测

    6、结束备份

    select pg_stop_backup();

    7、切换日志

    checkpoint; select pg_switch_wal(); checkpoint; select pg_switch_wal();

    8、检查归档已正常

    确保打快照过程中产生的WAL全部归档成功。

    9、生成每个测试表的CHECKSUM

    do language plpgsql $$ declare n name; res int8; begin for n in select tablename from pg_tables where schemaname='public' loop execute 'select sum(hashtext(t.*::text)) from '||n||' t' into res; raise notice 'checksum %: %', n, res; end loop; end; $$; NOTICE: checksum pgbench_history: -422046586146 NOTICE: checksum pgbench_tellers: 215006661683 NOTICE: checksum pgbench_branches: 158568443210 ......

    基于云盘镜像的恢复

    实际上PostgreSQL本身已支持增量备份、块级增量备份的功能,但是既然有云盘快照,可以只备归档和云盘镜像,数据少走一次网络开销。

    基于PG本身的备份也简单介绍一下,见本文参考部分。

    恢复过程

    1、创建新ECS(可以根据镜像创建)

    2、如果根据数据库ECS镜像创建,则不需要这一步。

    部署postgresql软件,注意需要与主库软件、插件等一致。

    3、根据云盘快照创建云盘

    4、复原逻辑卷

    pvscan vgscan lvscan

    5、检查文件系统

    fsck.ext4 -y /dev/mapper/vgdata01-lv01

    6、加载文件系统

    mount -o defaults,noatime,nodiratime,nodelalloc,barrier=1,data=ordered LABEL=lv01 /data01

    7、清理WAL文件,(因为是多盘快照,快照不可能在同一时间点,因此WAL文件可能出现partial write。wal文件是恢复数据文件的关键,因此必须完整。所以我们从归档来获取WAL。)。

    rm -f $PGDATA/pg_wal/*

    注意,如果是单盘,可以只删除最后一个WAL。

    8、配置恢复

    cp $PGHOME/share/recovery.conf.sample $PGDATA/recovery.conf vi $PGDATA/recovery.conf # (auto copy archive to $PGDATA/pg_wal) restore_command = 'cp /archivedir/%f %p' # e.g. 'cp /mnt/server/archivedir/%f %p' standby_mode = on

    9、启动数据库

    pg_ctl start

    10、等待恢复完成,恢复完成即恢复到最后一个已归档的WAL文件。

    11、激活数据库

    pg_ctl promote

    12、检查CHECKSUM

    do language plpgsql $$ declare n name; res int8; begin for n in select tablename from pg_tables where schemaname='public' loop execute 'select sum(hashtext(t.*::text)) from '||n||' t' into res; raise notice 'checksum %: %', n, res; end loop; end; $$;

    13、检查所有数据块是否正常。

    set vacuum_freeze_min_age =0; vacuum freeze;

    小结

    1、PostgreSQL 具备WAL和FULL PAGE WRITE,可以实现数据文件的不一致恢复,也就是说备份时的数据文件partial write可以被检查点后的wal full page write PAGE修复。最终实现一致性。

    2、多块云盘并行备份,提高了数据备份的速度。

    3、多块云盘,提升了整体的读写IOPS和读写带宽。

    4、PostgreSQL 通过组提交、异步提交、数据文件异步写,解决了云盘单次IO延迟较本地延迟更高一点的性能问题。也就是说对IO延迟不敏感。

    5、PostgreSQL 利用多核并行,可以在单条SQL中,充分利用多块云盘带来的高IOPS和高读写带宽的能力。

    6、通过逻辑卷、软RADI、ZFS等手段,可以充分利用多云盘的能力。

    参考

    1、《PostgreSQL on Linux 最佳部署手册》

    2、zfs

    《PostgreSQL 最佳实践 - 块级增量备份(ZFS篇)验证 - recovery test script for zfs snapshot clone + postgresql stream replication + archive》

    《PostgreSQL 最佳实践 - 块级增量备份(ZFS篇)双机HA与块级备份部署》

    《PostgreSQL 最佳实践 - 块级增量备份(ZFS篇)单个数据库采用多个zfs卷(如表空间)时如何一致性备份》

    《PostgreSQL 最佳实践 - 块级增量备份(ZFS篇)备份集自动校验》

    《PostgreSQL 最佳实践 - 块级增量备份(ZFS篇)方案与实战》

    3、man lvm

    4、man mdadm

    5、备份原理

    《PostgreSQL 最佳实践 - 块级别增量备份(pg_rman baseon LSN)源码浅析与使用》

    6、时间点恢复原理与最佳实践

    《PostgreSQL 最佳实践 - 任意时间点恢复源码分析》

    《PostgreSQL 最佳实践 - 在线增量备份与任意时间点恢复》

    7、https://github.com/pgbackrest/pgbackrest

    相关资源:七夕情人节表白HTML源码(两款)

    最新回复(0)