【Sqoop】使用Hive和Sqoop实现网站基本指标PV和UV的统计

【案例需求】统计某网站24小时内每个时段的PV和UV值。
【步骤分析】
（1）建分区表进行多级分区，按天一级分区，按小时二级分区。
（2）获取时间字段，天，小时
对于时间格式进行数据清洗，比如：2015-08-28 18:10:00，从中获取日期和小时。
获取PV和UV统计需要的字段：id、url、guid、trackTime。
（3）使用Select SQL进行数据分析。
（4）使用sqoop导出PV和UV统计结果。
【实现过程】
（1）建库：

create database track_log;

（2）建表：源表

create table yhd_source(
id              string,
url             string,
referer         string,
keyword         string,
type            string,
guid            string,
pageId          string,
moduleId        string,
linkId          string,
attachedInfo    string,
sessionId       string,
trackerU        string,
trackerType     string,
ip              string,
trackerSrc      string,
cookie          string,
orderCode       string,
trackTime       string,
endUserId       string,
firstLink       string,
sessionViewNo   string,
productId       string,
curMerchantId   string,
provinceId      string,
cityId          string,
fee             string,
edmActivity     string,
edmEmail        string,
edmJobId        string,
ieVersion       string,
platform        string,
internalKeyword string,
resultSum       string,
currentPage     string,
linkPosition    string,
buttonPosition  string
)
row format delimited fields terminated by '\t'
stored as textfile;

注：shift+alt 可以进行下拉列式编辑
（3）加载数据

load data local inpath '/opt/datas/2015082818' into table yhd_source;
load data local inpath '/opt/datas/2015082819' into table yhd_source;

（4）创建分区表，分区方式：静态分区

create table yhd_part1(
id string,
url string,
guid string
)
partitioned by (date string,hour string)
row format delimited fields terminated by '\t';

（5）加载数据，来源于source源表

insert into table yhd_part1 partition (date='20150828',hour='18') select id,url,guid from yhd_qingxi where date='28' and hour='18';
insert into table yhd_part1 partition (date='20150828',hour='19') select id,url,guid from yhd_qingxi where date='28' and hour='19';

查询静态分区表中的数据

select id,date,hour from yhd_part1 where date='20150828' and hour='18';

（6）建一张清洗表,将时间字段清洗，提取部分的时间字段出来

create table yhd_qingxi(
id string,
url string,
guid string,
date string,
hour string
)
row format delimited fields terminated by '\t';

（7）字段截取，天和小时

insert into table yhd_qingxi select id,url,guid,substring(trackTime,9,2) date,substring(trackTime,12,2) hour from yhd_source;

（8）分区方式：动态分区，配置如下。

<property>
		<name>hive.exec.dynamic.partition</name>
		<value>true</value>
		<description>Whether or not to allow dynamic partitions in DML/DDL.</description>
</property>

默认值是true，代表允许使用动态分区实现

<property>
		<name>hive.exec.dynamic.partition.mode</name>
		<value>strict</value>
		<description>In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.</description>
</property>

使用非严格模式：set hive.exec.dynamic.partition.mode=nonstrict;
（9）建动态分区表

create table yhd_part2(
id string,
url string,
guid string
)
partitioned by (date string,hour string)
row format delimited fields terminated by '\t';

执行动态分区：insert into table yhd_part2 partition (date,hour) select * from yhd_qingxi;
注：也可以不写select * ，但是要写全字段
（10）首先根据select * 找到表，按照里面的字段date hour进行匹配

insert into table yhd_part2 partition (date='20150828',hour='19') select id,url,guid from yhd_qingxi where date='28' and hour='19';

（11）实现PV的统计

select date,hour,count(url) PV from yhd_part2 group by date,hour;

按照天和小时进行分区，结果如下：

+-----------+-------+--------+--+
|   date    | hour  |   pv   |
+-----------+-------+--------+--+
| 20150828  | 18    | 64972  |
| 20150828  | 19    | 61162  |
+-----------+-------+--------+--+

（12）实现UV的统计

select date,hour,count(distinct guid) UV from yhd_part1 group by date,hour;

结果如下：

+-----------+-------+--------+--+
|   date    | hour  |   uv   |
+-----------+-------+--------+--+
| 20150828  | 18    | 23938  |
| 20150828  | 19    | 22330  |
+-----------+-------+--------+--+

访问网站的用户身份：游客和会员。
无论是游客还是会员都会有一个guid，endUserId是只针对于会员的，即使用账号登录的用户，不包含游客。所以UV统计的是guid的数量。
（13）将PV和UV结合统计

create table if not exists result as select date,hour,count(url) PV ,count(distinct guid) UV from yhd_part1 group by date,hour;

结果如下：

+--------------+--------------+------------+------------+--+
| result.date  | result.hour  | result.pv  | result.uv  |
+--------------+--------------+------------+------------+--+
| 20150828     | 18           | 64972      | 23938      |
| 20150828     | 19           | 61162      | 22330      |
+--------------+--------------+------------+------------+--+

（14）将结果导出到MySQL表中
1）先在MySQL中建表，用于保存结果集

create table if not exists save(
date varchar(30) not null,
hour varchar(30) not null,
pv varchar(30) not null,
uv varchar(30) not null,
primary key(date,hour)
);

2）使用Sqoop实现导出到MySQL

bin/sqoop export \
--connect \
jdbc:mysql://bigdata-senior01.ibeifeng.com:3306/sqoop \
--username root \
--password 123456 \
--table save \
--export-dir /user/hive/warehouse/track_log.db/result \
--num-mappers 1 \
--input-fields-terminated-by '\001'

注：Hive默认的分隔符：\001
3）MySQL中的查询结果如下：

+----------+------+-------+-------+
| date     | hour | pv    | uv    |
+----------+------+-------+-------+
| 20150828 | 18   | 64972 | 23938 |
| 20150828 | 19   | 61162 | 22330 |
+----------+------+-------+-------+

魏晓蕾博客专家

发布了219 篇原创文章 · 获赞 603 · 访问量 129万+

他的留言板关注

【Sqoop】使用Hive和Sqoop实现网站基本指标PV和UV的统计

猜你喜欢