集群间数据拷贝
- 采用discp命令实现两个hadoop集群之间的递归数据复制
- hadoop distcp hdfs://cmaster0:8020/user/hadoop/hello.txt hdfs://hadoop102:9000/user/hadoop/hello.txt
Hadoop存档
- 每个文件均按块存储,每个块的元数据存储在namenode的内存中,因此hadoop存储小文件会非常低效。因为大量的小文件会耗尽namenode中的大部分内存。但注意,存储小文件所需要的磁盘容量和存储这些文件原始内容所需要的磁盘空间相比也不会增多。例如,一个1MB的文件以大小为128MB的块存储,使用的是1MB的磁盘空间,而不是128MB。
- Hadoop存档文件或HAR文件,是一个更高效的文件存档工具,它将文件存入HDFS块,在减少namenode内存使用的同时,允许对文件进行透明的访问。具体说来,Hadoop存档文件可以用作MapReduce的输入。
归档文件
- 归档成一个叫做test.har的文件夹,该文件夹下有相应的数据文件。test.har目录是一个整体,该目录看成是一个归档文件即可。
[hadoop@cmaster0 hadoop]$ hadoop fs -ls -R /user/hadoop/test
-rw-r--r-- 3 hadoop supergroup 101 2018-12-27 01:17 /user/hadoop/test/NOTICE.txt
-rw-r--r-- 3 hadoop supergroup 49 2018-12-27 01:03 /user/hadoop/test/aa.input
-rw-r--r-- 3 hadoop supergroup 23 2018-12-27 04:59 /user/hadoop/test/bb.txt
-rw-r--r-- 3 hadoop supergroup 33 2018-12-22 05:03 /user/hadoop/test/tt.input
-rw-r--r-- 3 hadoop supergroup 46 2018-12-22 05:03 /user/hadoop/test/wc.input
[hadoop@cslave0 subdir0]$ hadoop fs -ls -R /user/hadoop/out
[hadoop@cmaster0 hadoop]$ hadoop archive -archiveName test.har -p /user/hadoop/test /user/hadoop/out
查看归档
[hadoop@cmaster0 hadoop]$ hadoop fs -ls -R /user/hadoop/out/test.har
-rw-r--r-- 3 hadoop supergroup 0 2018-12-27 13:34 /user/hadoop/out/test.har/_SUCCESS
-rw-r--r-- 5 hadoop supergroup 433 2018-12-27 13:34 /user/hadoop/out/test.har/_index
-rw-r--r-- 5 hadoop supergroup 23 2018-12-27 13:34 /user/hadoop/out/test.har/_masterindex
-rw-r--r-- 3 hadoop supergroup 252 2018-12-27 13:34 /user/hadoop/out/test.har/part-0
[hadoop@cmaster0 hadoop]$ hadoop fs -ls -R har:///user/hadoop/out/test.har
-rw-r--r-- 3 hadoop supergroup 101 2018-12-27 01:17 har:///user/hadoop/out/test.har/NOTICE.txt
-rw-r--r-- 3 hadoop supergroup 49 2018-12-27 01:03 har:///user/hadoop/out/test.har/aa.input
-rw-r--r-- 3 hadoop supergroup 23 2018-12-27 04:59 har:///user/hadoop/out/test.har/bb.txt
-rw-r--r-- 3 hadoop supergroup 33 2018-12-22 05:03 har:///user/hadoop/out/test.har/tt.input
-rw-r--r-- 3 hadoop supergroup 46 2018-12-22 05:03 har:///user/hadoop/out/test.har/wc.input
解归档文件
[hadoop@cmaster0 hadoop]$ hadoop fs -ls -R /user/hadoop/output
[hadoop@cmaster0 hadoop]$ hadoop fs -cp har:///user/hadoop/out/test.har/* /user/hadoop/output
[hadoop@cmaster0 hadoop]$ hadoop fs -ls -R /user/hadoop/output
-rw-r--r-- 3 hadoop supergroup 101 2018-12-27 13:44 /user/hadoop/output/NOTICE.txt
-rw-r--r-- 3 hadoop supergroup 49 2018-12-27 13:44 /user/hadoop/output/aa.input
-rw-r--r-- 3 hadoop supergroup 23 2018-12-27 13:44 /user/hadoop/output/bb.txt
-rw-r--r-- 3 hadoop supergroup 33 2018-12-27 13:44 /user/hadoop/output/tt.input
-rw-r--r-- 3 hadoop supergroup 46 2018-12-27 13:44 /user/hadoop/output/wc.input
快照管理
- 快照相当于对目录做一个备份。并不会立即复制所有文件,而是指向同一个文件。当写入发生时,才会产生新文件。
- (1)hdfs dfsadmin -allowSnapshot 路径 (功能描述:开启指定目录的快照功能)
- (2)hdfs dfsadmin -disallowSnapshot 路径 (功能描述:禁用指定目录的快照功能,默认是禁用)
- (3)hdfs dfs -createSnapshot 路径 (功能描述:对目录创建快照)
- (4)hdfs dfs -createSnapshot 路径 名称 (功能描述:指定名称创建快照)
- (5)hdfs dfs -renameSnapshot 路径 旧名称 新名称 (功能描述:重命名快照)
- (6)hdfs lsSnapshottableDir (功能描述:列出当前用户所有可快照目录)
- (7)hdfs snapshotDiff 路径1 路径2 (功能描述:比较两个快照目录的不同之处)
- (8)hdfs dfs -deleteSnapshot
(功能描述:删除快照)
(1)开启/禁用指定目录的快照功能
[hadoop@cmaster0 hadoop]$ hdfs dfsadmin -allowSnapshot /user/hadoop/output
Allowing snaphot on /user/hadoop/output succeeded
(2)对目录创建快照
[hadoop@cmaster0 hadoop]$ hdfs dfs -createSnapshot /user/hadoop/output
Created snapshot /user/hadoop/output/.snapshot/s20181227-135036.020
[hadoop@cmaster0 hadoop]$ hadoop fs -ls -R /user/hadoop/output
-rw-r--r-- 3 hadoop supergroup 101 2018-12-27 13:44 /user/hadoop/output/NOTICE.txt
-rw-r--r-- 3 hadoop supergroup 49 2018-12-27 13:44 /user/hadoop/output/aa.input
-rw-r--r-- 3 hadoop supergroup 23 2018-12-27 13:44 /user/hadoop/output/bb.txt
-rw-r--r-- 3 hadoop supergroup 33 2018-12-27 13:44 /user/hadoop/output/tt.input
-rw-r--r-- 3 hadoop supergroup 46 2018-12-27 13:44 /user/hadoop/output/wc.input
[hadoop@cmaster0 hadoop]$ hadoop fs -ls -R /user/hadoop/output/.snapshot/
drwxr-xr-x - hadoop supergroup 0 2018-12-27 13:50 /user/hadoop/output/.snapshot/s20181227-135036.020
-rw-r--r-- 3 hadoop supergroup 101 2018-12-27 13:44 /user/hadoop/output/.snapshot/s20181227-135036.020/NOTICE.txt
-rw-r--r-- 3 hadoop supergroup 49 2018-12-27 13:44 /user/hadoop/output/.snapshot/s20181227-135036.020/aa.input
-rw-r--r-- 3 hadoop supergroup 23 2018-12-27 13:44 /user/hadoop/output/.snapshot/s20181227-135036.020/bb.txt
-rw-r--r-- 3 hadoop supergroup 33 2018-12-27 13:44 /user/hadoop/output/.snapshot/s20181227-135036.020/tt.input
-rw-r--r-- 3 hadoop supergroup 46 2018-12-27 13:44 /user/hadoop/output/.snapshot/s20181227-135036.020/wc.input
(3)列出当前用户所有可快照目录
[hadoop@cmaster0 hadoop]$ hdfs lsSnapshottableDir
drwxr-xr-x 0 hadoop supergroup 0 2018-12-27 13:50 1 65536 /user/hadoop/output
(4)比较两个快照目录的不同之处
[hadoop@cmaster0 wcinput]$ hadoop fs -ls -R /user/hadoop/output
-rw-r--r-- 3 hadoop supergroup 101 2018-12-27 13:44 /user/hadoop/output/NOTICE.txt
-rw-r--r-- 3 hadoop supergroup 49 2018-12-27 13:44 /user/hadoop/output/aa.input
-rw-r--r-- 3 hadoop supergroup 23 2018-12-27 13:44 /user/hadoop/output/bb.txt
-rw-r--r-- 3 hadoop supergroup 33 2018-12-27 13:44 /user/hadoop/output/tt.input
-rw-r--r-- 3 hadoop supergroup 46 2018-12-27 13:44 /user/hadoop/output/wc.input
[hadoop@cmaster0 wcinput]$ hadoop fs -put bb.input /user/hadoop/output
[hadoop@cmaster0 wcinput]$ hadoop fs -rm /user/hadoop/output/NOTICE.txt
18/12/27 13:57:06 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/hadoop/output/NOTICE.txt
[hadoop@cmaster0 wcinput]$ hadoop fs -ls -R /user/hadoop/output
-rw-r--r-- 3 hadoop supergroup 49 2018-12-27 13:44 /user/hadoop/output/aa.input
-rw-r--r-- 3 hadoop supergroup 66 2018-12-27 13:56 /user/hadoop/output/bb.input
-rw-r--r-- 3 hadoop supergroup 23 2018-12-27 13:44 /user/hadoop/output/bb.txt
-rw-r--r-- 3 hadoop supergroup 33 2018-12-27 13:44 /user/hadoop/output/tt.input
-rw-r--r-- 3 hadoop supergroup 46 2018-12-27 13:44 /user/hadoop/output/wc.input
[hadoop@cmaster0 wcinput]$ hdfs snapshotDiff /user/hadoop/output/ . .snapshot/s20181227-135036.020
Difference between current directory and snapshot s20181227-135036.020 under directory /user/hadoop/output:
M .
- ./bb.input
+ ./NOTICE.txt
(5)恢复快照
[hadoop@cmaster0 wcinput]$ hadoop fs -ls -R /user/hadoop/bak
[hadoop@cmaster0 wcinput]$ hdfs dfs -cp /user/hadoop/output/.snapshot/s20181227-135036.020/*.* /user/hadoop/bak
[hadoop@cmaster0 wcinput]$ hadoop fs -ls -R /user/hadoop/bak
-rw-r--r-- 3 hadoop supergroup 101 2018-12-27 14:07 /user/hadoop/bak/NOTICE.txt
-rw-r--r-- 3 hadoop supergroup 49 2018-12-27 14:07 /user/hadoop/bak/aa.input
-rw-r--r-- 3 hadoop supergroup 23 2018-12-27 14:07 /user/hadoop/bak/bb.txt
-rw-r--r-- 3 hadoop supergroup 33 2018-12-27 14:07 /user/hadoop/bak/tt.input
-rw-r--r-- 3 hadoop supergroup 46 2018-12-27 14:07 /user/hadoop/bak/wc.input
回收站
1)默认回收站
- 默认值fs.trash.interval=0,0表示禁用回收站,可以设置删除文件的存活时间。
- 默认值fs.trash.checkpoint.interval=0,检查回收站的间隔时间。
- 要求fs.trash.checkpoint.interval<=fs.trash.interval。
2)启用回收站,修改core-site.xml
<property>
<!-- 配置垃圾回收时间为1分钟 -->
<name>fs.trash.interval</name>
<value>1</value>
</property>
<property>
<!-- 进入垃圾回收站用户名称,默认是dr.who,修改为hadoop用户 -->
<name>hadoop.http.staticuser.user</name>
<value>hadoop</value>
</property>
3)通过程序删除的文件不会经过回收站,需要调用moveToTrash()才进入回收站
Trash trash = New Trash(conf);
trash.moveToTrash(path);
4)恢复回收站数据
[hadoop@cmaster0 hadoop]$ hadoop fs -rm /user/hadoop/test/aa.input
18/12/27 14:24:39 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://cmaster0:8020/user/hadoop/test/aa.input' to trash at: hdfs://cmaster0:8020/user/hadoop/.Trash/Current
[hadoop@cmaster0 hadoop]$ hadoop fs -mv /user/hadoop/.Trash/181227142500/user/hadoop/test/aa.input /user/hadoop/test/
5)清空回收站
[hadoop@cmaster0 hadoop]$ hdfs dfs -expunge
18/12/27 14:29:40 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1 minutes, Emptier interval = 0 minutes.