HiveSQL例题-常用函数与基础语法讲解

我们通过一些简单的案例来讲解Hive的常用基础语法以及一些常用的函数。

学习目标：

1. 掌握HIVE基础语法、常用函数及其组合使用

2. 掌握一些基本业务指标的分析思路与实现技巧

1、基础语法：

SELECT …A… FROM …B… WHERE …C…

A：列名

B：表名

C：筛选条件

需求1：
某次经营活动中，商家发起了"异性拼团购"，试着针对某个地区的用户进行推广，找出匹配用户。
思考：
根据需求，我们可以参考实现选出地区城市在北京，性别为女的10个用户名来进行匹配。这个实现可以有很多种，按照你所想的去实现需求即可。

SELECT user_name
FROM user_info
WHERE city='beijing' and sex='female'
limit 10;

limit 10 代表只展现前10行数据。
需求2：
某天，发现食物类的商品卖的很好，你能找出几个资深吃货吗？
思考：
商品种类有很多种，这里需要展现食物类的商品，那么我们挑选具体的某一天看看用户购买食物类商品的数量，金额。我们选在2019年6月18日，购买的商品品类是food的用户名、购买数量、支付金额

SELECT user_name,piece,pay_amount
FROM user_trade
WHERE dt='2019-06-18' and goods_category='food' ;

注意：如果该表是一个分区表，则WHERE条件中必须对分区字段进行限
制。

GROUP BY
分组函数：分类汇总

需求3：
试着对本公司2019年第一季度商品的热度与价值度进行分析。
思考：
首先筛选条件，需求需要的是2019年第一季度，那么我们选择2019年的1月-3月，商品的热度和价值度要看你怎么理解，这里我们通过查看不同类商品的购买人数和购买金额来分析。

SELECT goods_category,
count(distinct user_name) as user_num,
sum(pay_amount) as total_amount
FROM user_trade
WHERE dt between '2019-01-01' and '2019-03-31'
GROUP BY goods_category;

这里的不同类商品我们就用到了分组函数group by 来实现。
distinct 去重、count 计数、sum 求和、between...and... 两者之间
这里给大家整理一下常用的聚合函数：

count()：计数 count(distinct ……) 去重计数

sum()：求和

avg()：平均值

max()：最大值

min()：最小值

GROUP BY …… HAVING
HAVING：对GROUP BY的对象进行筛选，是对聚合结果进行筛选而不是对
原表进行筛选。

需求4-1：
2019年4月，支付金额超过5万元的用户。给VIP用户赠送优惠劵。
思考：
直接筛选2019年4月，计算支付金额，统计支付金额大于5万元的用户。

SELECT user_name,
sum(pay_amount) as total_amount
FROM user_trade
WHERE dt between '2019-04-01' and '2019-04-30'
GROUP BY user_name
HAVING sum(pay_amount)>50000;

HAVING对聚合函数结果进行筛选。

ORDER BY
ASC：升序(默认，不写的时候默认就是升序)
DESC：降序
对多个字段进行排序：ORDER BY A ASC , B DESC 每个字段都要指定
升序还是降序！

需求4-2：
2019年4月，支付金额最多的TOP5用户
思考：
直接筛选2019年4月，统计支付总额，选出前5名用户。

SELECT user_name,
sum(pay_amount) as total_amount
FROM user_trade
WHERE dt between '2019-04-01' and '2019-04-30'
GROUP BY user_name
ORDER BY total_amount DESC
limit 5;

注意：
为什么ORDER BY 后面不直接写sum(pay_amount)而是用total_amount？
不可以写：ORDER BY sum(pay_amount) DESC
——原因：执行顺序！！！ORDER BY的执行顺序在SELECT之后，所以需
使用重新定义的列名进行排序。

执行顺序

FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY

2、常用函数：

（1）如何把时间戳转化为日期？

在这里插入图片描述
可以看到user_trade这张表中字段pay_time是时间戳类型的值。
例：

 SELECT pay_time,
from_unixtime(pay_time,'yyyy-MM-dd hh:mm:ss')
FROM user_trade
WHERE dt='2019-04-09';

from_unixtime(bigint unixtime, string format)：将时间戳转化为指定格式的日期，bigint unixtime时间戳的字段名
format：

yyyy-MM-dd hh:mm:ss

yyyy-MM-dd hh

yyyy-MM-dd hh:mm

yyyyMMdd

有把时间戳转换为日期格式，那么也有将日期转换为时间戳
利用 unix_timestamp 进行转换

（2）如何计算日期间隔？

需求5：
去年的劳动节新用户推广活动价值分析。即拉新分析。
思考：
劳动节的推广活动分析，那么我们可以统计用户的首次激活时间与2019年5月1日的日期间隔。

select user_name,datediff('2019-05-01',to_date(firstactivetime))
from user_info
limit 10;

拓展：日期增加函数、减少函数——date_add、date_sub（类型要是string类型的）

date_add(string startdate, int days)
date_sub (string startdate, int days)

（3）条件函数：

`case when`

需求6：
对用户的年龄段进行分析，观察分布情况。
思考：
我们统计以下四个年龄段：20岁以下、20-30岁、30-40岁、40岁以上的
用户数。

SELECT case when age<20 then '20岁以下'
when age>=20 and age<30 then '20-30岁'
when age>=30 and age<40 then '30-40岁'
else '40岁以上' end as age_type,
count(distinct user_id) user_num
FROM user_info
GROUP BY case when age<20 then '20岁以下'
when age>=20 and age<30 then '20-30岁'
when age>=30 and age<40 then '30-40岁'
else '40岁以上' end;

`if`

需求7：
去年王思聪的微博抽奖活动引起争议，我们想要观察用户等级随性别的
分布情况。
思考：
用户等级随性格的分布情况，我们统计每个性别用户等级高低的分布情况(level大于5为高级)

SELECT sex,
if(level>5,'高','低') as level_type,
count(distinct user_id) user_num
FROM user_info
GROUP BY sex,
if(level>5,'高','低');

（4）字符串函数：

需求8：
分析每个月的拉新情况，可以倒推回运营效果。

SELECT substr(firstactivetime,1,7) as month,
count(distinct user_id) user_num
FROM user_info
GROUP BY substr(firstactivetime,1,7);

substr(string A, int start, int len)字符串截取
备注：如果不指定截取长度，则从起始位一直截取到最后。
需求9:
不同手机品牌的用户数：

extra1(string)：
{“systemtype”:“ios”,“education”:“master”,“marriage_status”:“1”,“phone
brand”:“iphone X”} extra2(map<string,string>)：
{“systemtype”:“ios”,“education”:“master”,“marriage_status”:“1”,“phone
brand”:“iphone X”}
这是表中的两个拓展字段，extra1字段中的信息是json类型的数据，extra2中是map类型的数据。

第一种写法：是通过json字符串进行抽取数据

SELECT get_json_object(extra1, '$.phonebrand') as
phone_brand,
count(distinct user_id) user_num
FROM user_info
GROUP BY get_json_object(extra1, '$.phonebrand');

get_json_object(string json_string, string path)
param1：需要解析的json字段
param2：用$.key取出想要获取的value

第二种写法：通过键值对取数据，针对map类型数据

SELECT extra2['phonebrand'] as phone_brand,
count(distinct user_id) user_num
FROM user_info
GROUP BY extra2['phonebrand'];

（5）聚合统计函数：

练习：
ELLA用户的2018年的平均每次支付金额，以及2018年最大的支付
日期与最小的支付日期的间隔

SELECT avg(pay_amount) as avg_amount,
datediff(max(from_unixtime(pay_time,'yyyy-MMdd')),
min(from_unixtime(pay_time,'yyyy-MM-dd')))
FROM user_trade
WHERE year(dt)='2018' and user_name='ELLA';

max(from_unixtime(pay_time,‘yyyy-MM-
dd’))=from_unixtime(max(pay_time),‘yyyy-MM-dd’))
datediff(max(pay_time),min(pay_time))

`练习：`

需求10：
找出在2018年具有VIP潜质的用户，发送VIP试用劵。
思考：
具有VIP潜质的用户，我们可以理解为购买商品的种类比较多的用户有可能发展为VIP用户，我们统计2018年购买的商品品类在两个以上的用户数。

SELECT count(a.user_name)
FROM
(SELECT user_name,
count(distinct goods_category) as category_num
FROM user_trade
WHERE year(dt)='2018'
GROUP BY user_name
HAVING count(distinct goods_category)>2) a;

思路流程：
三步走：
第一步：先求出每个人购买的商品品类数
第二步：筛选出购买商品品类数大于2的用户
第三步：统计符合条件的用户有多少个

需求11：
用户激活时间在2018年，年龄段在20-30岁和30-40岁的婚姻状况分布。
思考、流程：
第一步：先选出激活时间在2018年的用户，并把他们所在的年龄段计算
好，并提取出婚姻状况
第二步：取出年龄段在20-30岁和30-40岁的用户，把他们的婚姻状况转义
成可理解的说明
第三步：聚合计算，针对年龄段、婚姻状况的聚合

SELECT a.age_type,
if(a.marriage_status=1,'已婚','未婚'),
count(distinct a.user_id)
FROM
(SELECT case when age<20 then '20岁以下'
when age>=20 and age<30 then '20-30岁'
when age>=30 and age<40 then '30-40岁'
else '40岁以上' end as age_type,
get_json_object(extra1, '$.marriage_status') as
marriage_status,
user_id
FROM user_info
WHERE to_date(firstactivetime) between '2018-01-01'
and '2018-12-31') a
WHERE a.age_type in ('20-30岁','30-40岁')
GROUP BY a.age_type,
if(a.marriage_status=1,'已婚','未婚');

需求12：
激活天数距今超过300天的男女分布情况。

select sex,count(distinct user_name)
from user_info where datediff('2020-03-23',to_date(firstactivetime)) > 300
group by sex;

需求13：
不同性别、教育程度的分布情况。

select sex,get_json_object(extra1,'$.education') as user_edu,count(distinct user_name) as user_num
from user_info
group by sex,get_json_object(extra1,'$.education');

select sex,extra2['education'],count(distinct user_id)
from user_info
group by sex,extra2['education'];

需求14：
2019年1月1日到2019年4月30日，每个时段的不同品类购买金额分布。
select substr(from_unixtime(pay_time,‘yyyy-MM-dd HH’),12) as user_time,goods_category,sum(pay_amount)
from user_trade
where dt between ‘2019-01-01’ and ‘2019-04-30’
group by substr(from_unixtime(pay_time,‘yyyy-MM-dd HH’),12),goods_category;

总结：
1、利用group by做聚合计算。
2、利用order by做排序。
3、牢记SQL的执行顺序。
4、常用函数组合的使用。

糖潮丽子~辣丽

发布了49 篇原创文章 · 获赞 76 · 访问量 2681

私信关注