Hive教程（二）

一、hive文件格式

存储文本文件的表：

create table t_access_text(ip string,url string,access_time string)
row format delimited fields terminated by ','
stored as textfile;

存储sequence file文件的表：

create table t_access_seq(ip string,url string,access_time string)
stored as sequencefile;

存储parquet file文件的表：

create table t_access_parq(ip string,url string,access_time string)
stored as parquetfile;

二、数据类型

1、整形：

整型数据可以指定使用整型数据类型，INT。当数据范围超过INT的范围，需要使用BIGINT，如果数据范围比INT小，使用SMALLINT。 TINYINT比SMALLINT小。

类型	后缀	示例
TINYINT	Y	10Y
SMALLINT	S	10S
INT	-	10
BIGINT	L	10L

create table t_test(a string ,b int,c bigint,d float,e double,f tinyint,g smallint)

2、字符串类型

字符串类型的数据类型可以使用单引号('')或双引号(“”)来指定。它包含两个数据类型：VARCHAR和CHAR。

3、日期

DATE值在年/月/日的格式形式描述 {{YYYY-MM-DD}}.

4、时间戳：

它支持传统的UNIX时间戳可选纳秒的精度。它支持的java.sql.Timestamp格式“YYYY-MM-DD HH:MM:SS.fffffffff”和格式“YYYY-MM-DD HH:MM:ss.ffffffffff”。

5、小数点
在Hive 小数类型与Java大十进制格式相同。它是用于表示不可改变任意精度。语法和示例如下：

DECIMAL(precision, scale)
decimal(10,0)

6、联合类型

联合是异类的数据类型的集合。可以使用联合创建的一个实例。语法和示例如下：

{index：数据类型} 前一个数表示uniontype定义数据类型的下标

UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1} 
{1:2.0} 
{2:["three","four"]} 
{3:{"a":5,"b":"five"}} 
{2:["six","seven"]} 
{3:{"a":8,"b":"eight"}} 
{0:9} 
{1:10.0}

7、浮点类型：有小数点的数字。通常，这种类型的数据组成DOUBLE数据类型。

8、复杂类型

（1）数组：

假如有以下数据，actor主演属性用arry数组比较方便（相同数据类型<>）

战狼2,吴京:吴刚:龙母,2017-08-16
三生三世十里桃花,刘亦菲:痒痒,2017-08-20

 ARRAY<data_type>

//建表
create table t_movie(moive_name string,actors array<string>,first_show date)
row format delimited fields terminated by ','
collection items terminated by ':';

//查询
select moive_name,actors[0] from t_movie;
select moive_name,actors from t_movie where array_contains(actors,'吴刚');       
select moive_name,size(actors) from t_movie;

（2）映射

假如有以下数据，家庭成员用map类型比较合适

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

MAP<primitive_type, data_type>

//建表语句：
create table t_person(id int,name string,family_members map<string,string>,age int)
row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';

//查询
## 取map字段的指定key的值
select id,name,family_members['father'] as father from t_person;

## 取map字段的所有key
select id,name,map_keys(family_members) as relation from t_person;

## 取map字段的所有value
select id,name,map_values(family_members) from t_person;
select id,name,map_values(family_members)[0] from t_person;

（3）结构体

假如有以下数据：个人信息包括整形：字符串：字符串，用一个类型可以使用struct

1,zhangsan,18:male:beijing
2,lisi,28:female:shanghai

STRUCT<col_name : data_type [COMMENT col_comment], ...>

//建表：
create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)
row format delimited fields terminated by ','
collection items terminated by ':';

//查询
select * from t_person_struct;
select id,name,info.age from t_person_struct;

三、修改表定义

修改表名：

ALTER TABLE table_name RENAME TO new_table_name

修改分区名：

alter table t_partition partition(department='xiangsheng',sex='male',howold=20) rename to partition(department='1',sex='1',howold=20);

添加分区

alter table t_partition add partition (department='2',sex='0',howold=40);

删除分区

alter table t_partition drop partition (department='2',sex='2',howold=24);

修改表的文本格式定义

ALTER TABLE table_name [PARTITION partitionSpec] SET FILEFORMAT file_format

修改表列名定义

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENTcol_comment] [FIRST|(AFTER column_name)]  

alter table t_user change price jiage float first;

增加替换列

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type[COMMENT col_comment], ...)  

alter table t_user add columns (sex string,addr string);
alter table t_user replace columns (id string,age int,price float);

四、hive查询语法

1、本地查询

对数据量比较小的操作，就可以在本地执行，这样要比提交任务到集群执行效率要快很多，开启Hive的本地模式：

hive> set hive.exec.mode.local.auto=true;(默认为false)

当一个job满足如下条件才能真正使用本地模式：
1.job的输入数据大小必须小于参数：hive.exec.mode.local.auto.inputbytes.max(默认128MB)
2.job的map数必须小于参数：hive.exec.mode.local.auto.tasks.max(默认4)
3.job的reduce数必须为0或者1

2、条件查询

select * from t_access where access_time<'2017-08-06 15:30:20'
select * from t_access where access_time<'2017-08-06 16:30:20' and ip>'192.168.33.3';

3、聚合函数

1）求总行数（count）

hive (default)> select count(*) cnt from emp;

2）求工资的最大值（max）

hive (default)> select max(sal) max_sal from emp;

3）求工资的最小值（min）

hive (default)> select min(sal) min_sal from emp;

4）求工资的总和（sum）

hive (default)> select sum(sal) sum_sal from emp;

5）求工资的平均值（avg）

hive (default)> select avg(sal) avg_sal from emp;

4、limit子句

hive (default)> select * from emp limit 5;

5、分组查询 order by

一旦有group by子句，那么，在select子句中就不能有（分组字段，聚合函数）以外的字段

//计算 emp 表每个部门的平均工资

hive (default)> select t.deptno, avg(t.sal) avg_sal from emp t group by t.deptno;

//计算 emp 每个部门中每个岗位的最高薪水

hive (default)> select t.deptno, t.job, max(t.sal) max_sal from emp t group by t.deptno,t.job;

select dt,count(*),max(ip) as cnt from t_access group by dt having dt>'20170804';

## 为什么where必须写在group by的前面，为什么group by后面的条件只能用having

因为，where是用于在真正执行查询逻辑之前过滤数据用的

having是对group by聚合之后的结果进行再过滤；

6、Join查询

（1）inner join（join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
join t_b b
on a.name=b.name

（2）left outer join（left join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
left outer join t_b b
on a.name=b.name

（3）right outer join（right join）

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
right outer join t_b b
on a.name=b.name

（4）full outer join（full join）

select

a.name as aname,

a.numb as anumb,

b.name as bname,

b.nick as bnick

from t_a a

full join t_b b

on a.name=b.name;

结果：

（5）left semi join

hive中不支持exist/IN子查询，可以用left semi join来实现同样的效果：

select 
a.name as aname,
a.numb as anumb
from t_a a
left semi join t_b b
on a.name=b.name;

left semi join的 select子句中，不能有右表的字段

7、子查询

select id,name,father 
from 
(select id,name,family_members['brother'] as father from t_person) tmp
where father is not null;

五、hive函数

1、类型转换函数 cast

select cast("5" as int)；
select cast("2017-08-03" as date) ;
select cast(current_timestamp as date);

2、数学运算函数

select round(5.4);   ## 5   四舍五入
select round(5.1345,3);  ##5.135     四舍五入保留三位有效数字
select ceil(5.4); // select ceiling(5.4);   ## 6 大于5.4最小整数 
select floor(5.4);  ## 5     小于5.4最大整数
select abs(-5.4);  ## 5.4    绝对值
select greatest(3,5,6);  ## 6   最大的
select least(3,5,6);    ##3  最小

select max(age) from t_person;    聚合函数
select min(age) from t_person;    聚合函数

3、字符串函数

(1)截取

substr(string, int start)   ## 截取子串
substring(string, int start)     ##从start位置开始，0 1表示第一个位置
示例：select substr("abcdefg",2);   --> bcdefg


substr(string, int start, int len) 
substring(string, int start, int len)
示例：select substr("abcdefg",2,3);     -->bcd

（2）拼接

concat(string A, string B...)  ## 拼接字符串
concat_ws(string SEP, string A, string B...)

##示例
select concat("ab","xy") from dual;
select concat_ws(".","192","168","33","44") from dual;

（3）长度

length(string A)
示例：select length("192.168.33.44")

（4）切分

split(string str, string pat)

select split("192.168.33.44","\\.");   ##\\.对.进行转义

（5）大小写

select upper(string str) ##转大写
lower()      #转小写

4、时间函数

获取当前时间

select current_timestamp;          -->2019-10-21 12:26:43.872
select current_date;               -->2019-10-21

获取当前时间的毫秒数时间戳

select unix_timestamp();

unix时间戳转字符串

from_unixtime(bigint unixtime[, string format])

示例：select from_unixtime(unix_timestamp());
select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");   ##规定格式

字符串转unix时间戳

unix_timestamp(string date, string pattern)

示例： select unix_timestamp("2017-08-10 17:50:30");

select unix_timestamp("2017/08/10 17:50:30","yyyy/MM/dd HH:mm:ss");
## 将字符串转成日期date
select to_date("2017-09-17 16:58:32");

5、表生成函数

（1）使用explode()对数组字段“炸裂”

select explode(subject) from student;

炸裂去重

select distinct tmp.sub
from 
(select explode(subjects) as sub from t_stu_subject) tmp;

（2）表生成函数 lateral view

select id,name,tmp.sub 
from score lateral view explode(subject) tmp as sub;

理解： lateral view 相当于两个表在join

左表：是原表

右表：是explode(某个集合字段)之后产生的表

而且：这个join只在同一行的数据间进行

6、集合函数

（1）array_contains(Array<T>, value) 返回boolean值示例：

select moive_name,array_contains(actors,'吴刚') from t_movie;
select array_contains(array('a','b','c'),'c');

（2）sort_array(Array<T>) 返回排序后的数组

select sort_array(array('c','b','a')); -->"a","b","c"

（3）size(Array<T>) 返回一个int值

select moive_name,size(actors) as actor_number from t_movie;
-->返回每一行数组的长度

size(Map<K.V>)  返回一个int值
map_keys(Map<K.V>)  返回一个数组
map_values(Map<K.V>) 返回一个数组

7、CASE WHEN 函数

语法：
CASE   [ expression ]
       WHEN condition1 THEN result1
       WHEN condition2 THEN result2
       ...
       WHEN conditionn THEN resultn
       ELSE result
END

示例：
select id,name,
case
when age<28 then 'youngth'
when age>27 and age<40 then 'zhongnian'
else 'old'
end
from t_user;

8、if 函数

select id,if(age>25,'working','worked') from t_user;

select moive_name,if(array_contains(actors,'吴刚'),'好电影','rom t_movie;

9、json解析函数生成新表

json_tuple函数

select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_rating_json;

10、分析函数

row_number函数，对表中的数据按照性别分组，按照年龄倒序排序并进行标记

select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rank
from t_rownumber

辛聪明

发布了77 篇原创文章 · 获赞 19 · 访问量 4087

私信关注

一、hive文件格式

二、数据类型

三、修改表定义

四、hive查询语法

五、hive函数

猜你喜欢