找回HDFS corrupted文件残留数据

    xiaoxiao2025-10-25  1

    因为某种历史原因集群出现了一批corrupted文件。读取这些文件会报“BlockMissingException”异常,例如:

    6/11/08 19:04:20 WARN hdfs.DFSClient: DFS Read org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-221196964-172.23.64.95-1477965231106:blk_1073748785_7961 file=/hadoop-2.7.3.tar.gz at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:983) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:642) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:91) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:59) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:119) at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:466) at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:391) at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:328) at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:263) at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:248) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:317) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:289) at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:243) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:271) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:255) at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:220) at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:201) at org.apache.hadoop.fs.shell.Command.run(Command.java:165) at org.apache.hadoop.fs.FsShell.run(FsShell.java:287) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)

    而通过fsck命令可以看到,corrupted文件只丢失了一个或几个block,还有多个block残留,例如:

    bin/hdfs fsck /hadoop-2.7.3.tar.gz -files -blocks Connecting to namenode via http://localhost:50070/fsck?ugi=user&files=1&blocks=1&path=/hadoop-2.7.3.tar.gz FSCK started by user (auth:SIMPLE) from /127.0.0.1 for path /hadoop-2.7.3.tar.gz at Tue Nov 08 19:03:38 CST 2016 /hadoop-2.7.3.tar.gz 214092195 bytes, 2 block(s): Under replicated BP-221196964-172.23.64.95-1477965231106:blk_1073748784_7960. Target Replicas is 3 but found 1 replica(s). /hadoop-2.7.3.tar.gz: CORRUPT blockpool BP-221196964-172.23.64.95-1477965231106 block blk_1073748785 MISSING 1 blocks of total size 79874467 B 0. BP-221196964-172.23.64.95-1477965231106:blk_1073748784_7960 len=134217728 repl=1 1. BP-221196964-172.23.64.95-1477965231106:blk_1073748785_7961 len=79874467 MISSING!

    在Missing Block找回无望情况下,尽可能降低损失的办法是恢复残留的数据。但是读取corrupted文件又会失败,怎么处理呢?答案就是hack代码。需要hack的类是DFSInputStream,原理就是遇到Missing Block就跳过,能读多少读多少。以下是hack的diff:

    --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java @@ -882,7 +882,20 @@ private synchronized int readWithStrategy(ReaderStrategy strategy, int off, int // currentNode can be left as null if previous read had a checksum // error on the same block. See HDFS-3067 if (pos > blockEnd || currentNode == null) { - currentNode = blockSeekTo(pos); + while (true) { + try { + currentNode = blockSeekTo(pos); + } catch (BlockMissingException e) { + LocatedBlock targetBlock = getBlockAt(pos); + DFSClient.LOG.warn("Ignore BlockMissingException, try next block " + targetBlock.getBlock()); + pos += targetBlock.getBlockSize(); + if (pos >= getFileLength()) { + return -1; + } + continue; + } + break; + } } int realLen = (int) Math.min(len, (blockEnd - pos + 1L)); synchronized(infoLock) {

    hack之后替换下客户端的hadoop-hdfs-2.7.2.jar即可。这样就可以下载文件,或者拷贝到HDFS其他路径。由于遇到读取失败的Block,DFSClient会重试三次,中间会sleep一段时间。为了加快速度可以把sleep时间缩短,比如设置dfs.client.retry.window.base=1。以上为一种恢复corrupted文件残留数据的一种方式。当然HDFS数据管理上策是设置3个备份,中策是一旦发现corrupted文件,尽量到磁盘上找回block数据。下策才是本文介绍的方法。

    最新回复(0)