1.提前过滤数据,减少中间数据依赖
比如
select ... from A join B on A.key=B.key
where A.userid >10 and B.userid < 10 and A.dt='20120417' and B.dt='20120417'
改成
select ... from ( select ... from A where dt='20120417' and userid >10) a join
( select ... from B where dt='20120417' and userid >10) b on a.key=b.key
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
2.慎用map join
小表在左边,避免引起磁盘和内存的大量消耗
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
3.慎用笛卡尔积
笛卡尔积只有一个reduce任务,计算慢,可能计算不出来或者导致节点挂掉
1.以下形式sql会导致笛卡尔积:
select * from gbk.utf8 where gbk.key =utf8.key and gbk.key > 10;
select * from gbk join utf8 where gbk.key =uf8.key and gbk.key > 10;
连接范例:
select * from gbk join utf8 on gbk.key=utf8.key where gbk.key > 10
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
4.列修剪和分区修剪
列修剪:
读取数据值读取查询需要的列,
select a,b from t where e < 10
t 包含五个列(a,b,c,d,e) c d列就会被忽略,只会读取 a b e列
此次选项参数默认为真:hive.optimize.cp=true
分区修剪:
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
5.uion all 改成成join
如果需要把几个数据集结果何必ing能使用join就不要使用union all,因为使用 union all 时 通常需要加入大量的0,增加系统负担
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
6. 大数据量时,全局 count/sum distinct更快
句子A:select count(distinct uin)login uins
from tabled where ftime >=20121001 and ftime <=20121001
句子B:
select count( uin) login_uins
from(
select distinct uin from tabled where ftime >= 20121001 and fime <=20121001 ) subq
句子A 1 个MR ,只有一个1 reduce任务,导致这个reduce任务需要读取和处理大量的数据,导致执行慢,如果数据量大 可能reduce任务节点down掉
句子B 先做去重,然后全局统计 因此全局数据集count distinct 操作的时候尽量使用句子B(子查询类型)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
7.有小表存在的情况下 看看能否使用mapjoin
a.参与连接的小表的行数 以不超过2W条为宜
b. 连接类型是 inner join ,right outer join(小表不能是右表) left outer join(小表不能是左表),left semi join
样例
SELECT /*+ MAPJOIN(smalltable)*/ .key,value
FROM smalltable JOIN bigtable ON smalltable.key = bigtable.key
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
8.同一份数据多种处理
hive提供了一种独特的语法,可以从一个数据源产生多个数据聚合,而无非每次聚合都要重新扫描一次
insert overwrite table sales
select * from history where action=‘purchased’;
insert overwrite table credits
select * from history where action=‘returned’;
上面查询,语法正确,但是执行效率低下,可优化成如下 只需要扫描一次 history表
from history
insert overwrite table sales where action=‘purchased’
insert overwrite table credits where action=‘returned’;
持续更新中