窗口函数的基本语法如下:
<窗口函数> over (partition by <用于分组的列名>
order by <用于排序的列名>)
<窗口函数>的位置,可以放以下两种函数:
- 专用窗口函数
- 序号函数:row_number() / rank() / dense_rank()
- 分布函数:percent_rank() / cume_dist()
- 前后函数:lag() / lead()
- 头尾函数:first_val() / last_val()
- 其他函数:nth_value() (用途:返回窗口中第N个expr的值,expr可以是表达式,也可以是列名)/nfile()(用途:将分区中的有序数据分为n个桶,记录桶号)/nfile()
- 聚合函数,如sum(), avg(), count(), max(), min()等
注意事项
- 窗口函数对where和group by子句处理后的结果进行操作,所以原则上窗口函数只能写在select子句中
- 聚合函数是将多条记录聚合为一条;而窗口函数是每条记录都会执行,查询结果并不会改变记录条数,有几条记录执行完还是几条。
- 普通聚合函数也可以用于窗口函数中,赋予它窗口函数的功能。
- 窗口函数的执行顺序(逻辑上的)是在FROM,JOIN,WHERE,GROUP BY,HAVING之后,在ORDER BY,LIMIT,SELECT,DISTINCT之前。它执行时GROUP BY的聚合过程已经完成了,所以不会再产生数据聚合。
建表
create table student (sid char(2), sname char(5), sclass char(2));
create table course (cid char(2), cname char(10));
create table score (sid char(2), cid char(2), score int);
insert into student values('01', '崔健', '01');
insert into student values('02', '李健', '01');
insert into student values('03', '高虎', '01');
insert into student values('04', '子健', '01');
insert into student values('05', '石璐', '01');
insert into student values('06', '亚千', '01');
insert into student values('07', '史立', '01');
insert into student values('08', '窦唯', '01');
insert into student values('09', '华东', '01');
insert into course values('01', '金属');
insert into course values('02', '迷幻');
insert into course values('03', '朋克');
insert into course values('04', '后摇');
insert into score values('01', '01', 60);
insert into score values('02', '01', 85);
insert into score values('03', '01', 57);
insert into score values('04', '01', 34);
insert into score values('05', '01', 78);
insert into score values('06', '01', 90);
insert into score values('07', '01', 76);
insert into score values('08', '01', 90);
insert into score values('09', '01', 85);
insert into score values('01', '02', 78);
insert into score values('02', '02', 59);
insert into score values('03', '02', 59);
insert into score values('04', '02', 79);
insert into score values('05', '02', 88);
insert into score values('01', '03', 65);
insert into score values('03', '03', 89);
insert into score values('05', '03', 46);
insert into score values('06', '03', 85);
insert into score values('07', '03', 89);
insert into score values('08', '03', 79);
insert into score values('03', '04', 99);
insert into score values('04', '04', 95);
insert into score values('07', '04', 68);
insert into score values('08', '04', 59);
insert into score values('09', '04', 80);
1.专用窗口函数
1.1序号函数
- row_number() 、 rank() 、 dense_rank()都是序号函数,一个例子说明三者的区别,对每门课程的成绩排序:
SELECT s.sname, c.cname, sc.score,
ROW_NUMBER() OVER (PARTITION BY c.cname
ORDER BY sc.score DESC) AS row_num,
RANK() OVER (PARTITION BY c.cname
ORDER BY sc.score DESC) AS ranking,
DENSE_RANK() OVER(PARTITION BY c.cname
ORDER BY sc.score DESC) AS dense_ranking
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid
- row_number 相同成绩不会并列,按出现顺序排名
- rank 相同成绩会并列,且下一名为并列排名+并列人数
- dense_rank 相同成绩会并列,并且下一名为并列排名+1
1.2分布函数
- percent_rank()
用途:和之前的RANK()函数相关,每行按照如下公式进行计算:
(rank - 1) / (rows - 1)
其中,rank为RANK()函数产生的序号,rows为当前窗口的记录总行数。
SELECT s.sname, c.cname, sc.score, RANK() OVER(PARTITION BY c.cname ORDER BY sc.score DESC) as ranking,
PERCENT_RANK() OVER (PARTITION BY c.cname
ORDER BY sc.score DESC) as percent
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid
- cume_dist()
用途:分组内大于等于当前rank值的行数/分组内总行数,这个函数比percen_rank使用场景更多。
应用场景:某门课程各同学排名在前百分之几
SELECT s.sname, c.cname, sc.score, RANK() OVER(PARTITION BY c.cname ORDER BY sc.score DESC) as ranking,
CUME_DIST() OVER (PARTITION BY c.cname
ORDER BY sc.score DESC) as cumdist
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid;
亚千和窦唯的金属成绩并列第一,在班级排名前22.22%
1.3前后函数
lag和lead函数可以在同一次查询中取出同一字段的前N行数据(lag)和后N行数据(lead)
语法:
LAG(EXP_STR,OFFSET,DEFVAL)OVER()
LEAD(EXP_STR,OFFSET,DEFVAL)OVER()
EXP_STR
:要取的列
OFFSET
: 取偏移后的第几行数据
DEFVAL
:没有符合条件的默认值
应用场景:求每个用户相邻两次浏览的时间差;求每个同学相邻两门考试的成绩差
SELECT s.sname, c.cname, sc.score,
lead(sc.score,1) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as leadVal,
lag(sc.score,1) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as lagVal,
score - leadVal as diff1,
score - lagVal as diff2
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid;
1.4头尾函数
- first_val()/last_val()
用途:得到分区中的第一个/最后一个指定参数的值。
SELECT s.sname, c.cname, sc.score,
FIRST_VALUE(sc.score) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as firstVal,
LAST_VALUE(sc.score) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as lastVal
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid
1.5其他函数
- nth_value()
SELECT s.sname, s.sclass, c.cname, sc.score,
nth_value(sc.score,1) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as 1th,
nth_value(sc.score,2) OVER (PARTITION BY s.sname
ORDER BY sc.score DESC) as 2th
FROM student s INNER JOIN score sc ON s.sid = sc.sid
INNER JOIN course c ON sc.cid = c.cid
- nfile()
- 用途:将分区中的有序数据分为n个桶,记录桶号。
- 此函数在数据分析中应用较多,比如由于数据量大,需要将数据平均分配到N个并行的进程分别计算,此时就可以用NFILE(N)对数据进行分组,由于记录数不一定被N整除,所以数据不一定完全平均,多出来的部分则依次加给第一组、第二组···直到分配完。