Detailed analysis of hive window function / analysis function

My-sunmy 2021-01-21 23:24:28
detailed analysis hive window function


hive Window function / Analysis function

stay sql There is a kind of function called aggregate function , for example sum()、avg()、max() wait , This kind of function can aggregate multiple lines of data into one line according to the rules , Generally speaking, the number of rows after aggregation is less than that before aggregation . But sometimes we want to show the data before aggregation , And display the aggregated data , Now we introduce the window function . Window function is also called OLAP function / Analysis function , Window functions have both grouping and sorting functions .

The most important keyword of window function is partition by and order by.

The specific syntax is as follows :over (partition by xxx order by xxx)

sum,avg,min,max function

Prepare the data

 Create table statement :
create table bigdata_t1(
cookieid string,
createtime string, --day
pv int
) row format delimited
fields terminated by ',';
Load data :
load data local inpath '/root/hivedata/bigdata_t1.dat' into table bigdata_t1;
cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4
Turn on smart local mode
SET hive.exec.mode.local.auto=true;

SUM Function and window function : Results and ORDER BY relevant , The default is ascending .

#pv1
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime) as pv1
from bigdata_t1;
#pv2
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2
from bigdata_t1;
#pv3
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid) as pv3
from bigdata_t1;
#pv4
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv4
from bigdata_t1;
#pv5
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5
from bigdata_t1;
#pv6
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6
from bigdata_t1;
pv1: Within the group from the starting point to the current line pv The cumulative , Such as ,11 The no. pv1=10 The no. pv+11 The no. pv, 12 Number =10 Number +11 Number +12 Number
pv2: Same as pv1
pv3: Within group (cookie1) be-all pv Add up
pv4: Go ahead in groups + forward 3 That's ok , Such as ,11 Number =10 Number +11 Number , 12 Number =10 Number +11 Number +12 Number ,
13 Number =10 Number +11 Number +12 Number +13 Number , 14 Number =11 Number +12 Number +13 Number +14 Number
pv5: Go ahead in groups + forward 3 That's ok + Back up 1 That's ok , Such as ,14 Number =11 Number +12 Number +13 Number +14 Number +15 Number =5+7+3+2+4=21
pv6: Go ahead in groups + All the way back , Such as ,13 Number =13 Number +14 Number +15 Number +16 Number =3+2+4+4=13,
14 Number =14 Number +15 Number +16 Number =2+4+4=10

If you don't specify rows between, The default is from the starting point to the current line ;

If you don't specify order by, Then all the values in the group are accumulated ;

The key is to understand rows between meaning , It's also called window Clause

preceding: forward

following: Back up

current row: Current row

unbounded: The starting point

unbounded preceding From the front

unbounded following: To the end

AVG,MIN,MAX, and SUM Use the same .

row_number,rank,dense_rank,ntile function

Prepare the data

cookie1,2018-04-10,1
cookie1,2018-04-11,5
cookie1,2018-04-12,7
cookie1,2018-04-13,3
cookie1,2018-04-14,2
cookie1,2018-04-15,4
cookie1,2018-04-16,4
cookie2,2018-04-10,2
cookie2,2018-04-11,3
cookie2,2018-04-12,5
cookie2,2018-04-13,6
cookie2,2018-04-14,3
cookie2,2018-04-15,9
cookie2,2018-04-16,7
CREATE TABLE bigdata_t2 (
cookieid string,
createtime string, --day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile;
Load data :
load data local inpath '/root/hivedata/bigdata_t2.dat' into table bigdata_t2;
  • ROW_NUMBER() Use

    ROW_NUMBER() from 1 Start , In order , Generating a sequence of records within a group .

SELECT
cookieid,
createtime,
pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM bigdata_t2;
  • RANK and DENSE_RANK Use

    RANK() Generate the ranking of data items in the group , Equal ranking will leave a vacancy in the ranking .

    DENSE_RANK() Generate the ranking of data items in the group , Equal ranking will leave no vacancy in the ranking .

SELECT
cookieid,
createtime,
pv,
RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,
DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3
FROM bigdata_t2
WHERE cookieid = 'cookie1';
  • NTILE

    Sometimes there's a need : If the data is sorted into three parts , Business people only care about a part of it , How to take out one third of the data in the middle ?NTILE Function can satisfy .

    ntile It can be seen as : Distribute an ordered set of data evenly to a specified number (num) A bucket , Assign the bucket number to each line . If it's not evenly distributed , Then priority is given to the smaller number of barrels , And the number of rows that can be placed in each bucket is the most different 1.

    And then it can be based on the barrel number , Select before or after n What percentage of the data . The data will be fully displayed , It's just tagging the data ; We need to take a fraction of the data , You need to nest another layer and take it out according to the label .

SELECT
cookieid,
createtime,
pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,
NTILE(4) OVER(ORDER BY createtime) AS rn3
FROM bigdata_t2
ORDER BY cookieid,createtime;

Other window functions

lag,lead,first_value,last_value function

  • LAG
    LAG(col,n,DEFAULT) It is used to count up the n Row value The first parameter is the column name , The second parameter is upward n That's ok ( Optional , The default is 1), The third parameter is the default value ( When you go up n Behavior NULL When , Take the default value , If not specified , Then for NULL)
 SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM bigdata_t4;
last_1_time: It's the next up 1 It's worth it ,default by '1970-01-01 00:00:00'
cookie1 first line , Upward 1 Behavior NULL, So take the default 1970-01-01 00:00:00
cookie1 The third line , Upward 1 The row value is the second row value ,2015-04-10 10:00:02
cookie1 Sixth elements , Upward 1 Line value is the fifth line value ,2015-04-10 10:50:01
last_2_time: It's the next up 2 It's worth it , Specify the default value for
cookie1 first line , Upward 2 Behavior NULL
cookie1 The second line , Upward 2 Behavior NULL
cookie1 In the fourth row , Upward 2 The second line value of the behavior ,2015-04-10 10:00:02
cookie1 Seventh elements , Upward 2 The fifth line value of behavior ,2015-04-10 10:50:01
  • LEAD

    And LAG contrary
    LEAD(col,n,DEFAULT) Used for the next... In the statistics window n Row value
    The first parameter is the column name , The second parameter is next n That's ok ( Optional , The default is 1), The third parameter is the default value ( When you go down n Behavior NULL When , Take the default value , If not specified , Then for NULL)

 SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM bigdata_t4;
  • FIRST_VALUE

    After sorting in groups , Up to the current line , The first value is

 SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM bigdata_t4;
  • LAST_VALUE

    After sorting in groups , Up to the current line , Last value

 SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM bigdata_t4;

If you want to take the last value after sorting in a group , You need to be flexible :

 SELECT cookieid,
createtime,
url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM bigdata_t4
ORDER BY cookieid,createtime;

Particular attention order by

If you don't specify ORDER BY, Then sort them , There will be wrong results

 SELECT cookieid,
createtime,
url,
FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2
FROM bigdata_t4;

cume_dist,percent_rank function

These two sequence analysis functions are not very common , Be careful : Sequence functions do not support WINDOW Clause

  • Data preparation
 d1,user1,1000
d1,user2,2000
d1,user3,3000
d2,user4,4000
d2,user5,5000
CREATE EXTERNAL TABLE bigdata_t3 (
dept STRING,
userid string,
sal INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile;
Load data :
load data local inpath '/root/hivedata/bigdata_t3.dat' into table bigdata_t3;
  • CUME_DIST and order by It's about the sort order of

    CUME_DIST Number of lines less than or equal to the current value / The total number of rows in the group order Default order positive sequence Ascending
    such as , Count the number of people less than or equal to the current salary , The proportion of the total number of people

 SELECT
dept,
userid,
sal,
CUME_DIST() OVER(ORDER BY sal) AS rn1,
CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM bigdata_t3;
rn1: No, partition, All data are 1 Group , The total number of rows is 5,
first line : Less than or equal to 1000 The number of lines is 1, therefore ,1/5=0.2
The third line : Less than or equal to 3000 The number of lines is 3, therefore ,3/5=0.6
rn2: Group by Department ,dpet=d1 The number of lines is 3,
The second line : Less than or equal to 2000 The number of lines is 2, therefore ,2/3=0.6666666666666666
  • PERCENT_RANK

    PERCENT_RANK In the group ahead of RANK value -1/ The total number of rows in the group -1

 SELECT
dept,
userid,
sal,
PERCENT_RANK() OVER(ORDER BY sal) AS rn1, -- Within group
RANK() OVER(ORDER BY sal) AS rn11, -- Within group RANK value
SUM(1) OVER(PARTITION BY NULL) AS rn12, -- The total number of rows in the group
PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM bigdata_t3;
rn1: rn1 = (rn11-1) / (rn12-1)
first line ,(1-1)/(5-1)=0/4=0
The second line ,(2-1)/(5-1)=1/4=0.25
In the fourth row ,(4-1)/(5-1)=3/4=0.75
rn2: according to dept grouping ,
dept=d1 The total number of rows is 3
first line ,(1-1)/(3-1)=0
The third line ,(3-1)/(3-1)=1

grouping sets,grouping__id,cube,rollup function

These analysis functions are usually used for OLAP in , It can't add up , And we need to make statistics according to the indicators of drilling up and drilling down in different dimensions , such as , Minutes and hours 、 God 、 Of the month UV Count .

  • Data preparation
 2018-03,2018-03-10,cookie1
2018-03,2018-03-10,cookie5
2018-03,2018-03-12,cookie7
2018-04,2018-04-12,cookie3
2018-04,2018-04-13,cookie2
2018-04,2018-04-13,cookie4
2018-04,2018-04-16,cookie4
2018-03,2018-03-10,cookie2
2018-03,2018-03-10,cookie3
2018-04,2018-04-12,cookie5
2018-04,2018-04-13,cookie6
2018-04,2018-04-15,cookie3
2018-04,2018-04-15,cookie2
2018-04,2018-04-16,cookie1
CREATE TABLE bigdata_t5 (
month STRING,
day STRING,
cookieid STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile;
Load data :
load data local inpath '/root/hivedata/bigdata_t5.dat' into table bigdata_t5;
  • GROUPING SETS

    grouping sets It's one that will be more than one group by The logic is written in a sql A convenient way to write a sentence .

    It's equivalent to putting different dimensions of GROUP BY The result set is UNION ALL.

    GROUPING__ID, Indicates which grouping set the result belongs to .

 SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM bigdata_t5
GROUP BY month,day
GROUPING SETS (month,day)
ORDER BY GROUPING__ID;
grouping_id Indicates which grouping set this set of results belongs to ,
according to grouping sets The grouping conditions in month,day,1 Represent month,2 Represent day
Equivalent to
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM bigdata_t5 GROUP BY month UNION ALL
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day;

Again :

 SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM bigdata_t5
GROUP BY month,day
GROUPING SETS (month,day,(month,day))
ORDER BY GROUPING__ID;
Equivalent to
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM bigdata_t5 GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM bigdata_t5 GROUP BY month,day;
  • CUBE

    according to GROUP BY Aggregation of all combinations of dimensions .

 SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM bigdata_t5
GROUP BY month,day
WITH CUBE
ORDER BY GROUPING__ID;
Equivalent to
SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM bigdata_t5
UNION ALL
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM bigdata_t5 GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM bigdata_t5 GROUP BY month,day;
  • ROLLUP

    yes CUBE Subset , It's the leftmost dimension , Hierarchical aggregation from this dimension .

 such as , With month Dimensions are aggregated hierarchically :
SELECT
month,
day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM bigdata_t5
GROUP BY month,day
WITH ROLLUP
ORDER BY GROUPING__ID;
-- hold month and day Exchange order , with day Dimensions are aggregated hierarchically :
SELECT
day,
month,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM bigdata_t5
GROUP BY day,month
WITH ROLLUP
ORDER BY GROUPING__ID;
( here , Aggregate by day and month , The same as the result of day aggregation , Because of the father son relationship , If it's a combination of other dimensions , It will be different )

Search official account : Five minutes to learn big data , Get the secrets of big data learning , Your big data capabilities will achieve a qualitative leap

版权声明
本文为[My-sunmy]所创,转载请带上原文链接,感谢
https://javamana.com/2021/01/20210121214852816n.html

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云