背景知识见链接:elasticsearch备份与恢复3_使用ES-Hadoop将HDFS数据写入Elasticsearch中
项目参考《Elasticsearch集成Hadoop最佳实践》的tweets2HdfsMapper项目
项目源码:https://gitee.com/constfafa/ESToHDFS.git
开发过程:
1. 先在kibana中查看下索引的信息
"hits": [
{
"_index": "xxx-words",
"_type": "history",
"_id": "zankHWUBk5wX4tbY-gpZ",
"_score": 1,
"_source": {
"word": "abc",
"createTime": "2018-08-09 16:56:00",
"userId": "263",
"datetime": "2018-08-09T16:56:00Z"
}
},
{
"_index": "xxx-words",
"_type": "history",
"_id": "zqntHWUBk5wX4tbYFAqy",
"_score": 1,
"_source": {
"word": "bcd",
"createTime": "2018-08-09 16:59:00",
"userId": "263",
"datetime": "2018-08-09T16:59:00Z"
}
}
]
之后直接执行 hadoop jar history2hdfs-job.jar
执行过程如下
[root@docker02 jar]# hadoop jar history2hdfs-job.jar
18/06/07 04:04:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/06/07 04:04:42 INFO client.RMProxy: Connecting to ResourceManager at /192.168.211.104:8032
18/06/07 04:04:48 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/06/07 04:04:55 INFO util.Version: Elasticsearch Hadoop v6.2.3 [039a45c5a1]
18/06/07 04:04:58 INFO mr.EsInputFormat: Reading from [hzeg-history-words/history]
18/06/07 04:04:58 INFO mr.EsInputFormat: Created [2] splits
18/06/07 04:05:00 INFO mapreduce.JobSubmitter: number of splits:2
18/06/07 04:05:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528305729734_0007
18/06/07 04:05:05 INFO impl.YarnClientImpl: Submitted application application_1528305729734_0007
18/06/07 04:05:06 INFO mapreduce.Job: The url to track the job: http://docker02:8088/proxy/application_1528305729734_0007/
18/06/07 04:05:06 INFO mapreduce.Job: Running job: job_1528305729734_0007
18/06/07 04:09:31 INFO mapreduce.Job: Job job_1528305729734_0007 running in uber mode : false
18/06/07 04:09:42 INFO mapreduce.Job: map 0% reduce 0%
18/06/07 04:15:36 INFO mapreduce.Job: map 100% reduce 0%
18/06/07 04:17:26 INFO mapreduce.Job: Job job_1528305729734_0007 completed successfully
18/06/07 04:17:56 INFO mapreduce.Job: Counters: 47
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=230906
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=74694
HDFS: Number of bytes written=222
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=2
Rack-local map tasks=2
Total time spent by all maps in occupied slots (ms)=791952
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=791952
Total vcore-seconds taken by all map tasks=791952
Total megabyte-seconds taken by all map tasks=810958848
Map-Reduce Framework
Map input records=2
Map output records=2
Input split bytes=74694
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=8106
CPU time spent (ms)=20240
Physical memory (bytes) snapshot=198356992
Virtual memory (bytes) snapshot=4128448512
Total committed heap usage (bytes)=32157696
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=222
Elasticsearch Hadoop Counters
Bulk Retries=0
Bulk Retries Total Time(ms)=0
Bulk Total=0
Bulk Total Time(ms)=0
Bytes Accepted=0
Bytes Received=1102
Bytes Retried=0
Bytes Sent=296
Documents Accepted=0
Documents Received=0
Documents Retried=0
Documents Sent=0
Network Retries=0
Network Total Time(ms)=5973
Node Retries=0
Scroll Total=1
Scroll Total Time(ms)=666
后面仍旧报job history server找不到,并不影响
执行下面语句检查文件及数据是否正确
可以看到,最终实现了将索引文件存入HDFS的功能