执行hsql报文件缺失错误

大家好：

今天在执行hsql的时候，发现文件缺失的错误。以下为测试过程以及解决办法，希望对大家有用。

----创建测试表

create table employ_test(
employ_id  BIGINT comment '员工编码',
salary DECIMAL(20,2) COMMENT '员工薪水'
)
comment '员工信息测试表,测试删除分区文件' 
PARTITIONED BY (dept_no STRING comment '部门号');

---插入测试数据

insert into employ_test PARTITION (dept_no = '11')  values('11101',88);
insert into employ_test PARTITION (dept_no = '11')  values('11102',89);
insert into employ_test PARTITION (dept_no = '11')  values('11101',90);
insert into employ_test PARTITION (dept_no = '22')  values('22201',88);

--验证数据

hive> select * from employ_test;
OK
11101 88 11
11102 89 11
11101 90 11
22201 88 22

--查看表的分区数

hive> show partitions employ_test;
OK
dept_no=11
dept_no=22

---验证sql语句

hive> select dept_no,sum(salary) as salary from employ_test group by dept_no;
此次忽略中间的转换为mr的过程
11 267
22 88
Time taken: 6.054 seconds, Fetched: 2 row(s)

---查看hdfs上的数据文件

[bxapp@bzcrkmfx0ap1001 ~]$ hadoop fs -ls /user/testBT/dbc/employ_test
Found 2 items
drwx------   - testBT hdfs          0 2017-11-27 09:53 /user/testBT/dbc/employ_test/dept_no=11
drwx------   - testBT hdfs          0 2017-11-27 10:13 /user/testBT/dbc/employ_test/dept_no=22

---手工的删除hadfs上的数据文件

[bxapp@bzcrkmfx0ap1001 ~]$ hadoop fs -rm -r /user/testBT/dbc/employ_test/dept_no=22
17/11/27 10:14:13 INFO fs.TrashPolicyDefault: Moved: 'hdfs://cluster/user/testBT/dbc/employ_test/dept_no=22' to trash at: hdfs://cluster/user/testBT/.Trash/Current/user/testBT/dbc/employ_test/dept_no=221511748853212
[bxapp@bzcrkmfx0ap1001 ~]$ hadoop fs -ls /user/testBT/dbc/employ_test
Found 1 items
drwx------   - testBT hdfs          0 2017-11-27 09:53 /user/testBT/dbc/employ_test/dept_no=11

---再次检查表中的原数据信息

hive> show partitions employ_test;
OK
dept_no=11
dept_no=22
Time taken: 0.56 seconds, Fetched: 2 row(s)

结果充分的说明了一点,手工的删除hadfs上的数据文件后，表中的元数据信息并没有修改

---再次执行hql语句

hive> select dept_no,sum(salary) as salary from employ_test group by dept_no;
Query ID = bxapp_20171127101701_bfc89d7a-0ec0-41bd-a215-c0bf58c9a7a1
Total jobs = 1
Launching Job 1 out of 1

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1                 FAILED     -1          0        0       -1       0       0
Reducer 2             KILLED      2          0        0        2       0       0
--------------------------------------------------------------------------------
VERTICES: 00/02  [>>--------------------------] 0%    ELAPSED TIME: 1511749120.00 s
--------------------------------------------------------------------------------
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1509332682299_65302_2_00, diagnostics=[Vertex vertex_1509332682299_65302_2_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: employ_test initializer failed, vertex=vertex_1509332682299_65302_2_00 [Map 1], org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://cluster/user/testBT/dbc/employ_test/dept_no=22
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:306)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:408)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:155)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)

可以看到显示的是数据文件不存在的错误

---这次仅仅执行有分区的数据，是可以执行的

hive> select dept_no,sum(salary) as salary from employ_test where dept_no='11' group by dept_no;
Query ID = bxapp_20171127101918_f5e6f62c-5c09-47c1-b2e8-85b02307874c
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1509332682299_65302)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      2          2        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 4.28 s     
--------------------------------------------------------------------------------
OK
11 267
Time taken: 5.805 seconds, Fetched: 1 row(s)

---手工的删除元数据中的分区信息

hive> alter table employ_test drop partition (dept_no = '22');
OK
Time taken: 0.457 seconds
hive> show partitions employ_test;
OK
dept_no=11
Time taken: 0.456 seconds, Fetched: 1 row(s)

---再次执行hql语句

hive> select dept_no,sum(salary) as salary from employ_test group by dept_no;
Query ID = bxapp_20171127102150_12f76029-edf4-437f-9db7-84ea2bf6f2d8
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1509332682299_65302)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      2          2        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 3.85 s     
--------------------------------------------------------------------------------
OK
11 267
Time taken: 6.719 seconds, Fetched: 1 row(s)

实验证明：删除hdfs上的hive的数据文件之后，元数据信息并不会同步更新。需要手工的删除元数据中的分区信息，否则查询不存在的分区会报错

执行hsql报文件缺失错误

猜你喜欢