模拟block 损坏之后 如何定位以及修复.
1.创建一个文件并上传至hdfs
[root@ruozedata001 ~]# hdfs dfs -mkdir /blockrecover
[root@ruozedata001 ~]# echo "xiaolinzi" > blocktest.md
[root@ruozedata001 ~]# hdfs dfs -put blocktest.md /blockrecover
[root@ruozedata001 ~]# hdfs dfs -ls /blockrecover
Found 2 items
-rw-r--r-- 3 root hadoop 10 2019-08-22 15:00 /blockrecover/blocktest.md
-rw-r--r-- 3 root hadoop 18 2019-08-21 10:52 /blockrecover/ruozedata.md
#检查hdfs的健康状况
[root@ruozedata001 subdir0]# hdfs fsck /
Connecting to namenode via http://ruozedata002:50070/fsck?ugi=root&path=%2F
FSCK started by root (auth:SIMPLE) from /172.16.128.58 for path / at Thu Aug 22 15:12:55 CST 2019
...Status: HEALTHY
Total size: 11033 B
Total dirs: 2
Total files: 3
Total symlinks: 0
Total blocks (validated): 3 (avg. block size 3677 B)
Minimally replicated blocks: 3 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Thu Aug 22 15:12:55 CST 2019 in 2 milliseconds
**2.直接DN节点上删除⽂件⼀个block的⼀个副本 **
#获取block名称
[root@ruozedata001 ~]# hdfs fsck /blockrecover/blocktest.md -files -blocks
Connecting to namenode via http://ruozedata002:50070/fsck?ugi=root&files=1&blocks=1&path=%2Fblockrecover%2Fblocktest.md
FSCK started by root (auth:SIMPLE) from /172.16.128.58 for path /blockrecover/blocktest.md at Thu Aug 22 15:00:59 CST 2019
/blockrecover/blocktest.md 10 bytes, 1 block(s): OK
0. BP-1856248125-172.16.128.58-1566189843078:blk_1073741827_1003 len=10 Live_repl=3
Status: HEALTHY
Total size: 10 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 10 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Thu Aug 22 15:00:59 CST 2019 in 6 milliseconds
The filesystem under path '/blockrecover/blocktest.md' is HEALTHY
#获取block所在位置
root@ruozedata001 ~]# find ./ -name "*blk_1073741827_1003*"
./data/dfs/data/current/BP-1856248125-172.16.128.58-1566189843078/current/finalized/subdir0/subdir0/blk_1073741827_1003.meta
#进入
[root@ruozedata001 ~]# cd ./data/dfs/data/current/BP-1856248125-172.16.128.58-1566189843078/current/finalized/subdir0/subdir0
#查看
[root@ruozedata001 subdir0]# ll
total 32
-rw-r--r-- 1 root root 11005 Aug 21 10:04 blk_1073741825
-rw-r--r-- 1 root root 95 Aug 21 10:04 blk_1073741825_1001.meta
-rw-r--r-- 1 root root 18 Aug 21 10:52 blk_1073741826
-rw-r--r-- 1 root root 11 Aug 21 10:52 blk_1073741826_1002.meta
-rw-r--r-- 1 root root 10 Aug 22 14:59 blk_1073741827
-rw-r--r-- 1 root root 11 Aug 22 14:59 blk_1073741827_1003.meta
#删除blk_1073741827_1003 block 与副本
[root@ruozedata001 subdir0]# rm -rf blk_1073741827 blk_1073741827_1003.meta
[root@ruozedata001 subdir0]# ll
total 24
-rw-r--r-- 1 root root 11005 Aug 21 10:04 blk_1073741825
-rw-r--r-- 1 root root 95 Aug 21 10:04 blk_1073741825_1001.meta
-rw-r--r-- 1 root root 18 Aug 21 10:52 blk_1073741826
-rw-r--r-- 1 root root 11 Aug 21 10:52 blk_1073741826_1002.meta
#直接重启HDFS,直接模拟损坏效果,然后fsck检查:
-bash-4.2$ hdfs fsck /
Connecting to namenode via http://ruozedata002:50070/fsck?ugi=root&path=%2F
FSCK started by hdfs (auth:SIMPLE) from /172.16.128.58 for path / at Thu Aug 22 15:15:55 CST 2019
.
/blockrecover/blocktest.md: Under replicated BP-1856248125-172.16.128.58-1566189843078:blk_1073741827_1003. Target Replicas is 3 but found 2 live replica(s), 0 d
ecommissioned replica(s), 0 decommissioning replica(s).
...............................................................................Sta
tus: HEALTHY
Total size: 50194618424 B
Total dirs: 354
Total files: 1079
Total symlinks: 0
Total blocks (validated): 992 (avg. block size 50599413 B)
Minimally replicated blocks: 992 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 1 (0.10080645 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.998992
Corrupt blocks: 0
Missing replicas: 1 (0.033602152 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Sun Mar 03 16:02:04 CST 2019 in 148 milliseconds
The filesystem under path '/' is HEALTHY
3.手动修复
#修复前
[root@ruozedata001 subdir0]# ll
total 24
-rw-r--r-- 1 root root 11005 Aug 22 15:17 blk_1073741825
-rw-r--r-- 1 root root 6 Aug 22 15:23 blk_1073741825_1001.meta
-rw-r--r-- 1 root root 18 Aug 22 15:17 blk_1073741826
-rw-r--r-- 1 root root 11 Aug 22 15:17 blk_1073741826_1002.meta
#使用命令 hdfs debug 手动修复数据
[root@ruozedata001 subdir0]# hdfs debug recoverLease -path /blockrecover/blocktest.md -retries 10
recoverLease SUCCEEDED on /blockrecover/blocktest.md
#修复后数据
[root@ruozedata001 subdir0]# ll
total 32
-rw-r--r-- 1 root root 11005 Aug 22 15:17 blk_1073741825
-rw-r--r-- 1 root root 95 Aug 22 15:23 blk_1073741825_1001.meta
-rw-r--r-- 1 root root 18 Aug 22 15:17 blk_1073741826
-rw-r--r-- 1 root root 11 Aug 22 15:17 blk_1073741826_1002.meta
-rw-r--r-- 1 root root 10 Aug 22 15:25 blk_1073741827
-rw-r--r-- 1 root root 11 Aug 22 15:25 blk_1073741827_1003.meta
4.hdfs自动修复
当数据块损坏后,DN节点执行directoryscan操作之前,都不会发现损坏;
也就是directoryscan操作是间隔6h
dfs.datanode.directoryscan.interval : 21600
在DN向NN进行blockreport前,都不会恢复数据块;
也就是blockreport操作是间隔6h
dfs.blockreport.intervalMsec : 21600000
当NN收到blockreport才会进行恢复操作。
总结
⽣产上本⼈⼀般倾向于使⽤ ⼿动修复⽅式,但是前提要⼿动删除损坏的block块。
切记,是删除损坏block⽂件和meta⽂件,⽽不是删除hdfs⽂件。
当然还可以先把⽂件get下载,然后hdfs删除,再对应上传。
切记删除不要执⾏: hdfs fsck / -delete 这是删除损坏的⽂件, 那么数据不就丢了嘛;除⾮⽆所谓丢数据,或
者有信⼼从其他地⽅可以补数据到hdfs!