ORACLE RAC其中几个节点突然宕机,原因:localhost kernel: end

    xiaoxiao2022-07-12  139

    背景:

    2019年5月22日,12点多,在另外一个厂商调用我司的应用接口时,突然报错;另外,我司的业务系统的菜单功能,点击进去也是报错。

    前台报错如下:

    通过分析应用日志和中间件控制台数据源,发现连接的那个节点宕机了。

    第一步:srvctl status databse –d orcl 用此命令看的数据库状态;

    第二步:srvctl status database –d orcl –I orcl1 尝试启动此节点;--报集群软件有问题

    第三步:/u01/11.2.0/grid/bin/crsctl check cluster -all 发现集群软件挂掉了;

    第四步:使用root 用户 /u01/11.2.0/grid/bin/crsctl stop crs -f 强制关闭crs资源;

    第五步:用root 执行/u01/11.2.0/grid/bin/crsctl start crs 启动此节点crs资源;

    第六步:接着使用grid用户启动节点1 srvctl status database –d orcl –I orcl1 ,节点1实例启动,业务恢复正常。

    上面,使用常规的重新启动操作,数据库节点1恢复正常了,只能说这是万幸。具体的原因,还是需要继续查找的。

    如果使用常规手段,启动不了  就需要进一步分析日志。

    原因排查: 

    第一步,查看告警日志:alert_orcl1.log

    Tue May 21 12:33:44 2019 WARNING: Write Failed. group:1 disk:7 AU:4389 offset:901120 size:131072 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_dbw4_88627.trc: ORA-15080: synchronous I/O operation to a disk failed ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 131072 WARNING: failed to write mirror side 1 of virtual extent 2183 logical extent 0 of file 264 in group 1 on disk 7 allocation unit 4389  KCF: read, write or open error, block=0x443ee online=1         file=3 '+DATA/orcl/datafile/undotbs1.264.997101273'         error=15081 txt: '' Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_dbw4_88627.trc: 提示到此trace文件中查看。

    上面只能看到是DATA磁盘组磁盘好像有些问题,但是具体的看不出来。

    第二步,查看orcl1_dbw4_88627.trc

    WARNING: Write Failed. group:1 disk:5 AU:4389 offset:393216 size:131072 path:/dev/asm-diskk --可以看到对应的磁盘确实出问题了,导致读写磁盘失败,IO报错。      incarnation:0xd622c620 asynchronous result:'I/O error'      subsys:System iop:0x7ffff4440148 bufp:0x1ddbd7f000 osderr:0x0 osderr1:0x0 ORA-15080: synchronous I/O operation to a disk failed ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 5: Input/output error Additional information: -1 Additional information: 131072 WARNING: failed to write mirror side 1 of virtual extent 2184 logical extent 0 of file 264 in group 1 on disk 5 allocation unit 4389  KCF: read, write or open error, block=0x44430 online=1         file=3 '+DATA/orcl/datafile/undotbs1.264.997101273'         error=15081 txt: '' Encountered write error

    上面,可以看到确实磁盘出问题了,但是磁盘具体出什么问题了,trace里面看不出来,还需要看操作系统日志。

    第三步,查看操作系统日志:/var/log/messages文件

    May 21 03:28:45 localhost auditd[1926]: Audit daemon rotating log files May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] CDB: Write(10): 2a 00 34 14 5d 00 00 00 10 00 May 21 12:33:38 localhost kernel: end_request: I/O error, dev sdi, sector 873749760 May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] CDB: Write(10): 2a 00 34 14 5c 00 00 00 10 00 May 21 12:33:38 localhost kernel: end_request: I/O error, dev sdi, sector 873749504 --传说中的磁盘坏道。 May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] CDB: Read(10): 28 00 2d 3c f8 40 00 04 00 00 May 21 12:33:38 localhost kernel: end_request: I/O error, dev sdi, sector 758970432 May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK May 21 12:33:38 localhost kernel: sd 33:0:6:0: [sdi] CDB: Read(10): 28 00 2d 3c fc 40 00 03 c0 00 May 21 12:33:38 localhost kernel: end_request: I/O error, dev sdi, sector 758971456 May 21 12:33:42 localhost kernel: sd 33:0:6:0: [sdi]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK May 21 12:33:42 localhost kernel: sd 33:0:6:0: [sdi] CDB: Read(10): 28 00 2d cb 17 c0 00 00 10 00 May 21 12:33:42 localhost kernel: end_request: I/O error, dev sdi, sector 768284608 May 21 12:33:43 localhost kernel: sd 33:0:8:0: [sdj]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK May 21 12:33:43 localhost kernel: sd 33:0:8:0: [sdj] CDB: Read(10): 28 00 11 9d a0 60 00 00 20 00 May 21 12:33:43 localhost kernel: end_request: I/O error, dev sdj, sector 295542880 May 21 12:33:43 localhost kernel: sd 33:0:8:0: [sdj]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK May 21 12:33:43 localhost kernel: sd 33:0:8:0: [sdj] CDB: Write(10): 2a 00 0f 94 c0 14 00 00 02 00 May 21 12:33:43 localhost kernel: end_request: I/O error, dev sdj, sector 261406740

    最后,查阅相关资料:

    查到这里,给大家一些忠告:如果你是负责硬件运维的,日常就要做好监控了;如果你是负责应用系统运维的,把此事情给客户汇报,让客户协调硬件厂商去处理。

    结语:经过一段时间的监控,这样的磁盘故障暂时没有出现,后续继续监控。

    供大家学习,参考。 

    最新回复(0)