大家好:
今天在执行hsql的时候,发现文件缺失的错误。以下为测试过程以及解决办法,希望对大家有用。
----创建测试表
create table employ_test(
employ_id BIGINT comment '员工编码',
salary DECIMAL(20,2) COMMENT '员工薪水'
)
comment '员工信息测试表,测试删除分区文件'
PARTITIONED BY (dept_no STRING comment '部门号');
---插入测试数据
insert into employ_test PARTITION (dept_no = '11') values('11101',88);
insert into employ_test PARTITION (dept_no = '11') values('11102',89);
insert into employ_test PARTITION (dept_no = '11') values('11101',90);
insert into employ_test PARTITION (dept_no = '22') values('22201',88);
--验证数据
hive> select * from employ_test;
OK
11101 88 11
11102 89 11
11101 90 11
22201 88 22
--查看表的分区数
hive> show partitions employ_test;
OK
dept_no=11
dept_no=22
---验证sql语句
hive> select dept_no,sum(salary) as salary from employ_test group by dept_no;
此次忽略中间的转换为mr的过程
11 267
22 88
Time taken: 6.054 seconds, Fetched: 2 row(s)
---查看hdfs上的数据文件
[bxapp@bzcrkmfx0ap1001 ~]$ hadoop fs -ls /user/testBT/dbc/employ_test
Found 2 items
drwx------ - testBT hdfs 0 2017-11-27 09:53 /user/testBT/dbc/employ_test/dept_no=11
drwx------ - testBT hdfs 0 2017-11-27 10:13 /user/testBT/dbc/employ_test/dept_no=22
---手工的删除hadfs上的数据文件
[bxapp@bzcrkmfx0ap1001 ~]$ hadoop fs -rm -r /user/testBT/dbc/employ_test/dept_no=22
17/11/27 10:14:13 INFO fs.TrashPolicyDefault: Moved: 'hdfs://cluster/user/testBT/dbc/employ_test/dept_no=22' to trash at: hdfs://cluster/user/testBT/.Trash/Current/user/testBT/dbc/employ_test/dept_no=221511748853212
[bxapp@bzcrkmfx0ap1001 ~]$ hadoop fs -ls /user/testBT/dbc/employ_test
Found 1 items
drwx------ - testBT hdfs 0 2017-11-27 09:53 /user/testBT/dbc/employ_test/dept_no=11
---再次检查表中的原数据信息
hive> show partitions employ_test;
OK
dept_no=11
dept_no=22
Time taken: 0.56 seconds, Fetched: 2 row(s)
结果充分的说明了一点,手工的删除hadfs上的数据文件后,表中的元数据信息并没有修改
---再次执行hql语句
hive> select dept_no,sum(salary) as salary from employ_test group by dept_no;
Query ID = bxapp_20171127101701_bfc89d7a-0ec0-41bd-a215-c0bf58c9a7a1
Total jobs = 1
Launching Job 1 out of 1
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 FAILED -1 0 0 -1 0 0
Reducer 2 KILLED 2 0 0 2 0 0
--------------------------------------------------------------------------------
VERTICES: 00/02 [>>--------------------------] 0% ELAPSED TIME: 1511749120.00 s
--------------------------------------------------------------------------------
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1509332682299_65302_2_00, diagnostics=[Vertex vertex_1509332682299_65302_2_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: employ_test initializer failed, vertex=vertex_1509332682299_65302_2_00 [Map 1], org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://cluster/user/testBT/dbc/employ_test/dept_no=22
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:155)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
可以看到显示的是数据文件不存在的错误
---这次仅仅执行有分区的数据,是可以执行的
hive> select dept_no,sum(salary) as salary from employ_test where dept_no='11' group by dept_no;
Query ID = bxapp_20171127101918_f5e6f62c-5c09-47c1-b2e8-85b02307874c
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1509332682299_65302)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 2 2 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 4.28 s
--------------------------------------------------------------------------------
OK
11 267
Time taken: 5.805 seconds, Fetched: 1 row(s)
---手工的删除元数据中的分区信息
hive> alter table employ_test drop partition (dept_no = '22');
OK
Time taken: 0.457 seconds
hive> show partitions employ_test;
OK
dept_no=11
Time taken: 0.456 seconds, Fetched: 1 row(s)
---再次执行hql语句
hive> select dept_no,sum(salary) as salary from employ_test group by dept_no;
Query ID = bxapp_20171127102150_12f76029-edf4-437f-9db7-84ea2bf6f2d8
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1509332682299_65302)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 2 2 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 3.85 s
--------------------------------------------------------------------------------
OK
11 267
Time taken: 6.719 seconds, Fetched: 1 row(s)
实验证明:删除hdfs上的hive的数据文件之后,元数据信息并不会同步更新。需要手工的删除元数据中的分区信息,否则查询不存在的分区会报错