Hive 库表基本操作
创建数据库
hive> create database if not exists db1; hive> create schema if not exists db2;
删除数据库
hive> drop database db2; hive> drop schema db1;
创建表
CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, job string, year int) COMMENT 'Employee details' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
导入数据进表
准备数据文件sample.txt
[root@g12-1 ~]# cat /tmp/sample.txt 1201 Gopal 45000 TechnicalManager 2013 1202 Manisha 45000 ProofReader 2013 1203 Masthanvali 40000 TechnicalWriter 2014 1204 Kiran 40000 HrAdmin 2014 [root@g12-1 ~]#
导入数据进表
hive> LOAD DATA LOCAL INPATH '/tmp/sample.txt' OVERWRITE INTO TABLE employee; Loading data to table db1.employee Table db1.employee stats: [numFiles=1, numRows=0, totalSize=150, rawDataSize=0] OK Time taken: 0.354 seconds hive> select * from employee; OK 1201 Gopal 45000 TechnicalManager 2013 1202 Manisha 45000 ProofReader 2013 1203 Masthanvali 40000 TechnicalWriter 2014 1204 Kiran 40000 HrAdmin 2014 Time taken: 0.094 seconds, Fetched: 4 row(s) hive>
HiveQL
SELECT...WHERE
hive> select * from employee where salary > 40000;
ORDER BY
hive> select * from employee order by eid;
GROUP BY
hive> select salary,count(salary) from employee group by salary;
SELECT...JOIN
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
分区表
Hive的数据库是目录,表也是目录,分区表表目录的子目录
create table xx(...) partitioned by()
alter table xxx add partitions() ...
load data local inpath ... into table xxx partions (...);
bucket表(桶表)
create table xxx(...) ... clustered by (fileName) into n buckets;
桶表是数据文件.hash
HiveQL调优
1)explain 解释执行计划
explain extended select count(*) from employee;
explain formatted select count(*) from employee;
2)启用limit调优,避免全表扫描,使用抽样机制
select * from employee limit 1,2;
配置hive.limite.optimize.enable=true
3)JOIN
使用map端链接(/*+ streamtable(table) */)
连接查询表的大小是从左至右一次增长。
4)设置本地模式,在单台机器上处理所有任务
使用小数据情况
hive.exec.mode.local.auto=true //默认false
hive> set hive.exec.mode.local.auto=true;
...