目录
join时要注意交叉join的问题,部分场景会导致数据量翻倍
典型场景:点击某一商品后下单,要对用户点击商品进行group by或者distinct,否则用户对商品进行多次操作,会导致订单翻倍
inner join的流程
以下两条sql是一样的执行逻辑,也就是说两种写法完全一致
SELECT *
,UNIX_TIMESTAMP(t2.create_time)
FROM dwd_t_action t1
JOIN dwd_t_order t2
ON t1.uid = t2.from_uid
AND t1.target_roomid = t2.room_id
AND t1.local_timestamp / 1000 < UNIX_TIMESTAMP(t2.create_time)
WHERE t1.dt = '${bdp.system.bizdate}'
AND t1.action_type = 1
AND t2.dt = '${bdp.system.bizdate}'
AND t2.STATUS = 1
AND t2.money > 0
;
SELECT *
,UNIX_TIMESTAMP(t2.create_time)
FROM dwd_t_action t1
JOIN dwd_t_order t2
ON t1.uid = t2.from_uid
AND t1.target_roomid = t2.room_id
WHERE t1.dt = '${bdp.system.bizdate}'
AND t1.action_type = 1
AND t2.dt = '${bdp.system.bizdate}'
AND t2.STATUS = 1
AND t2.money > 0
AND t1.local_timestamp / 1000 < UNIX_TIMESTAMP(t2.create_time)
;
首先是按照分区优先查询,M1和M2呢
都是优先对自己的字段进行过滤
然后在对两个表进行join
在进行条件过滤
t1.local_timestamp / 1000 < UNIX_TIMESTAMP(t2.create_time)
left join /right join
第一步,inner改为left
SELECT *
,UNIX_TIMESTAMP(t2.create_time)
FROM dwd_t_action t1
LEFT JOIN dwd_t_order t2
ON t1.uid = t2.from_uid
AND t1.target_roomid = t2.room_id
AND t1.local_timestamp / 1000 < UNIX_TIMESTAMP(t2.create_time)
WHERE t1.dt = '${bdp.system.bizdate}'
AND t1.action_type = 1
AND t2.dt = '${bdp.system.bizdate}'
AND t2.STATUS = 1
AND t2.money > 0
;
SELECT *
,UNIX_TIMESTAMP(t2.create_time)
FROM dwd_t_action t1
LEFT JOIN dwd_t_order t2
ON t1.uid = t2.from_uid
AND t1.target_roomid = t2.room_id
WHERE t1.dt = '${bdp.system.bizdate}'
AND t1.action_type = 1
AND t2.dt = '${bdp.system.bizdate}'
AND t2.STATUS = 1
AND t2.money > 0
AND t1.local_timestamp / 1000 < UNIX_TIMESTAMP(t2.create_time)
;
此时代码执行逻辑和上面是一样的
第二步,加上非连接字段过滤
看起来好像inner join 和left join 的区别不大,但是我们是内连接是没有null的,左连接包含左边所有数据,右边补null,那么我们家一个t2.is is null
SELECT *
,UNIX_TIMESTAMP(t2.create_time)
FROM dwd_t_action t1
LEFT JOIN dwd_t_order t2
ON t1.uid = t2.from_uid
AND t1.target_roomid = t2.room_id
WHERE t1.dt = '${bdp.system.bizdate}'
AND t1.action_type = 1
AND t2.dt = '${bdp.system.bizdate}'
AND t2.STATUS = 1
AND t2.money > 0
AND t1.local_timestamp / 1000 < UNIX_TIMESTAMP(t2.create_time)
AND t2.id IS NULL
;
在看一下M1
先把t2.id is null 过滤掉了,然后进行join,导致结果没有数据
因为left join得到的结果就是左边所有的结果,右边没有就补null,此时在想做下一步操作要对结果表进行操作了
第三步,改为连接字段过滤
但是,这样写就不一样了
SELECT *
,UNIX_TIMESTAMP(t2.create_time)
FROM (
SELECT *
FROM dwd_t_action t1
WHERE t1.dt = '${bdp.system.bizdate}'
AND t1.action_type = 1
) t1
LEFT JOIN (
SELECT *
FROM dwd_t_order t2
WHERE t2.dt = '${bdp.system.bizdate}'
AND t2.STATUS = 1
AND t2.money > 0
) t2
ON t1.uid = t2.from_uid
AND t1.target_roomid = t2.room_id
AND t1.local_timestamp / 1000 < UNIX_TIMESTAMP(t2.create_time)
WHERE t2.to_user_id IS NULL
;
可以看到在M1任务没有过滤t2.id is null
merge后进行了过滤
第四步,正常写法测试
那这样写,我给你加一个条件,发现就在M1进行了过滤
SELECT *
,UNIX_TIMESTAMP(t2.create_time)
FROM (
SELECT *
FROM dwd_t_action t1
WHERE t1.dt = '${bdp.system.bizdate}'
AND t1.action_type = 1
) t1
LEFT JOIN (
SELECT *
FROM dwd_t_order t2
WHERE t2.dt = '${bdp.system.bizdate}'
AND t2.STATUS = 1
) t2
ON t1.uid = t2.from_uid
AND t1.target_roomid = t2.room_id
AND t1.local_timestamp / 1000 < UNIX_TIMESTAMP(t2.create_time)
WHERE t2.to_user_id IS NULL
AND t2.money > 0
;
既然任何形式join都不能优化task,我建议尽量按照第三步来写,代码比较明朗
left semi join
join on 属于 common join(shuffle join/reduce join),而 left semi join 则属于 map join(broadcast join)的一种变体
对于reduce side join,跨机器的数据传输量非常大,这成了join操作的一个瓶颈,如果能够在map端过滤掉不会参加join操作的数据,则可以大大节省网络IO,提升执行效率。
实现方法很简单:选取一个小表,假设是File1,将其参与join的key抽取出来,保存到文件File3中,File3文件一般很小,可以放到内存中。在map阶段,使用DistributedCache将File3复制到各个TaskTracker上,然后将File2中不在File3中的key对应的记录过滤掉,剩下的reduce阶段的工作与reduce side join相同
右边的表只能在 ON 子句中设置过滤条件,在 WHERE 子句、SELECT 子句或其他地方过滤都不行,而且只能有左表的字段
另外,子表(tmall_data_fdi_dim_main_auc)中存在重复的数据,当使用JOIN ON的时候,A,B表会关联出两条记录,应为ON上的条件符合;
而是用LEFT SEMI JOIN 当A表中的记录,在B表上产生符合条件之后就返回,不会再继续查找B表记录了,所以如果B表有重复,也不会产生重复的多条记录
总结
首先呢,sql是先join后where过滤,是没有什么问题的,不过后面做了一些优化,也就是下面这两个sql是一样的,包括执行逻辑,被解析之后的执行逻辑是相同的,但是只能是非join字段优先过滤
也就是先执行
- 表A:非join字段过滤等操作
- 表B:非join字段过滤等操作
- join
- from - where - group by - having - select - order by
SELECT *
FROM (
SELECT *
FROM t_a
WHERE name = 1
)
JOIN (
SELECT *
FROM t_b
WHERE name = 1
)
ON a.id = b.id
;
SELECT *
FROM t_a
JOIN t_b
ON a.id = b.id
WHERE a.name = 1
AND b.name = 1
;