hive視窗函式/分析函式詳細剖析

itread01 2021-01-22 01:15:13
大数据 分析 hive itread01 剖析


# hive視窗函式/分析函式在sql中有一類函式叫做聚合函式,例如sum()、avg()、max()等等,這類函式可以將多行資料按照規則聚集為一行,一般來講聚集後的行數是要少於聚集前的行數的。但是有時我們想要既顯示聚集前的資料,又要顯示聚集後的資料,這時我們便引入了視窗函式。視窗函式又叫OLAP函式/分析函式,視窗函式兼具分組和排序功能。視窗函式最重要的關鍵字是 **partition by** 和 **order by。**具體語法如下:**over (partition by xxx order by xxx)**## sum,avg,min,max 函式準備資料```建表語句:create table bigdata_t1(cookieid string,createtime string, --day pv int) row format delimited fields terminated by ',';載入資料:load data local inpath '/root/hivedata/bigdata_t1.dat' into table bigdata_t1;cookie1,2018-04-10,1cookie1,2018-04-11,5cookie1,2018-04-12,7cookie1,2018-04-13,3cookie1,2018-04-14,2cookie1,2018-04-15,4cookie1,2018-04-16,4開啟智慧本地模式SET hive.exec.mode.local.auto=true;```SUM函式和視窗函式的配合使用:結果和ORDER BY相關,預設為升序。```#pv1select cookieid,createtime,pv,sum(pv) over(partition by cookieid order by createtime) as pv1 from bigdata_t1;#pv2select cookieid,createtime,pv,sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2from bigdata_t1;#pv3select cookieid,createtime,pv,sum(pv) over(partition by cookieid) as pv3from bigdata_t1;#pv4select cookieid,createtime,pv,sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv4from bigdata_t1;#pv5select cookieid,createtime,pv,sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5from bigdata_t1;#pv6select cookieid,createtime,pv,sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6from bigdata_t1;pv1: 分組內從起點到當前行的pv累積,如,11號的pv1=10號的pv+11號的pv, 12號=10號+11號+12號pv2: 同pv1pv3: 分組內(cookie1)所有的pv累加pv4: 分組內當前行+往前3行,如,11號=10號+11號, 12號=10號+11號+12號, 13號=10號+11號+12號+13號, 14號=11號+12號+13號+14號pv5: 分組內當前行+往前3行+往後1行,如,14號=11號+12號+13號+14號+15號=5+7+3+2+4=21pv6: 分組內當前行+往後所有行,如,13號=13號+14號+15號+16號=3+2+4+4=13, 14號=14號+15號+16號=2+4+4=10```如果不指定rows between,預設為從起點到當前行;如果不指定order by,則將分組內所有值累加;關鍵是理解rows between含義,也叫做**window子句**:preceding:往前following:往後current row:當前行unbounded:起點unbounded preceding 表示從前面的起點unbounded following:表示到後面的終點AVG,MIN,MAX,和SUM用法一樣。## row_number,rank,dense_rank,ntile 函式準備資料```cookie1,2018-04-10,1cookie1,2018-04-11,5cookie1,2018-04-12,7cookie1,2018-04-13,3cookie1,2018-04-14,2cookie1,2018-04-15,4cookie1,2018-04-16,4cookie2,2018-04-10,2cookie2,2018-04-11,3cookie2,2018-04-12,5cookie2,2018-04-13,6cookie2,2018-04-14,3cookie2,2018-04-15,9cookie2,2018-04-16,7 CREATE TABLE bigdata_t2 (cookieid string,createtime string, --day pv INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile; 載入資料:load data local inpath '/root/hivedata/bigdata_t2.dat' into table bigdata_t2;```- ROW_NUMBER()使用 ROW_NUMBER()從1開始,按照順序,生成分組內記錄的序列。```SELECT cookieid,createtime,pv,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn FROM bigdata_t2;```- RANK 和 DENSE_RANK使用 RANK() 生成資料項在分組中的排名,排名相等會在名次中留下空位 。 DENSE_RANK()生成資料項在分組中的排名,排名相等會在名次中不會留下空位。```SELECT cookieid,createtime,pv,RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3 FROM bigdata_t2 WHERE cookieid = 'cookie1';```- NTILE 有時會有這樣的需求:如果資料排序後分為三部分,業務人員只關心其中的一部分,如何將這中間的三分之一資料拿出來呢?NTILE函式即可以滿足。 ntile可以看成是:把有序的資料集合平均分配到指定的數量(num)個桶中, 將桶號分配給每一行。如果不能平均分配,則優先分配較小編號的桶,並且各個桶中能放的行數最多相差1。 然後可以根據桶號,選取前或後 n分之幾的資料。資料會完整展示出來,只是給相應的資料打標籤;具體要取幾分之幾的資料,需要再巢狀一層根據標籤取出。```SELECT cookieid,createtime,pv,NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,NTILE(4) OVER(ORDER BY createtime) AS rn3FROM bigdata_t2 ORDER BY cookieid,createtime;```# 其他一些視窗函式## lag,lead,first\_value,last\_value 函式- LAG **LAG(col,n,DEFAULT) 用於統計視窗內往上第n行值**第一個引數為列名,第二個引數為往上第n行(可選,預設為1),第三個引數為預設值(當往上第n行為NULL時候,取預設值,如不指定,則為NULL)``` SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time, LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time FROM bigdata_t4; last_1_time: 指定了往上第1行的值,default為'1970-01-01 00:00:00' cookie1第一行,往上1行為NULL,因此取預設值 1970-01-01 00:00:00 cookie1第三行,往上1行值為第二行值,2015-04-10 10:00:02 cookie1第六行,往上1行值為第五行值,2015-04-10 10:50:01 last_2_time: 指定了往上第2行的值,為指定預設值 cookie1第一行,往上2行為NULL cookie1第二行,往上2行為NULL cookie1第四行,往上2行為第二行值,2015-04-10 10:00:02 cookie1第七行,往上2行為第五行值,2015-04-10 10:50:01```- LEAD 與LAG相反 **LEAD(col,n,DEFAULT) 用於統計視窗內往下第n行值** 第一個引數為列名,第二個引數為往下第n行(可選,預設為1),第三個引數為預設值(當往下第n行為NULL時候,取預設值,如不指定,則為NULL)``` SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time, LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time FROM bigdata_t4;```- FIRST_VALUE 取分組內排序後,截止到當前行,第一個值``` SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1 FROM bigdata_t4;```- LAST_VALUE 取分組內排序後,截止到當前行,最後一個值``` SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1 FROM bigdata_t4;``` 如果想要取分組內排序後最後一個值,則需要變通一下:``` SELECT cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1, FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2 FROM bigdata_t4 ORDER BY cookieid,createtime;``` **特別注意order by** 如果不指定ORDER BY,則進行排序混亂,會出現錯誤的結果``` SELECT cookieid, createtime, url, FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2 FROM bigdata_t4;```## cume_dist,percent_rank 函式這兩個序列分析函式不是很常用,**注意: 序列函式不支援WINDOW子句**- 資料準備``` d1,user1,1000 d1,user2,2000 d1,user3,3000 d2,user4,4000 d2,user5,5000 CREATE EXTERNAL TABLE bigdata_t3 ( dept STRING, userid string, sal INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile; 載入資料: load data local inpath '/root/hivedata/bigdata_t3.dat' into table bigdata_t3;```- CUME_DIST 和order by的排序順序有關係 CUME_DIST 小於等於當前值的行數/分組內總行數 order 預設順序 正序 升序 比如,統計小於等於當前薪水的人數,所佔總人數的比例``` SELECT dept, userid, sal, CUME_DIST() OVER(ORDER BY sal) AS rn1, CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM bigdata_t3; rn1: 沒有partition,所有資料均為1組,總行數為5, 第一行:小於等於1000的行數為1,因此,1/5=0.2 第三行:小於等於3000的行數為3,因此,3/5=0.6 rn2: 按照部門分組,dpet=d1的行數為3, 第二行:小於等於2000的行數為2,因此,2/3=0.6666666666666666```- PERCENT_RANK PERCENT_RANK 分組內當前行的RANK值-1/分組內總行數-1``` SELECT dept, userid, sal, PERCENT_RANK() OVER(ORDER BY sal) AS rn1, --分組內 RANK() OVER(ORDER BY sal) AS rn11, --分組內RANK值 SUM(1) OVER(PARTITION BY NULL) AS rn12, --分組內總行數 PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2 FROM bigdata_t3; rn1: rn1 = (rn11-1) / (rn12-1) 第一行,(1-1)/(5-1)=0/4=0 第二行,(2-1)/(5-1)=1/4=0.25 第四行,(4-1)/(5-1)=3/4=0.75 rn2: 按照dept分組, dept=d1的總行數為3 第一行,(1-1)/(3-1)=0 第三行,(3-1)/(3-1)=1```## grouping sets,grouping__id,cube,rollup 函式這幾個分析函式通常用於OLAP中,不能累加,而且需要根據不同維度上鑽和下鑽的指標統計,比如,分小時、天、月的UV數。- 資料準備``` 2018-03,2018-03-10,cookie1 2018-03,2018-03-10,cookie5 2018-03,2018-03-12,cookie7 2018-04,2018-04-12,cookie3 2018-04,2018-04-13,cookie2 2018-04,2018-04-13,cookie4 2018-04,2018-04-16,cookie4 2018-03,2018-03-10,cookie2 2018-03,2018-03-10,cookie3 2018-04,2018-04-12,cookie5 2018-04,2018-04-13,cookie6 2018-04,2018-04-15,cookie3 2018-04,2018-04-15,cookie2 2018-04,2018-04-16,cookie1 CREATE TABLE bigdata_t5 ( month STRING, day STRING, cookieid STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile; 載入資料: load data local inpath '/root/hivedata/bigdata_t5.dat' into table bigdata_t5;```- GROUPING SETS grouping sets是一種將多個group by 邏輯寫在一個sql語句中的便利寫法。 等價於將不同維度的GROUP BY結果集進行UNION ALL。 **GROUPING__ID**,表示結果屬於哪一個分組集合。``` SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day GROUPING SETS (month,day) ORDER BY GROUPING__ID; grouping_id表示這一組結果屬於哪個分組集合, 根據grouping sets中的分組條件month,day,1是代表month,2是代表day 等價於 SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM bigdata_t5 GROUP BY month UNION ALL SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day;``` 再如:``` SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY GROUPING__ID; 等價於 SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM bigdata_t5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM bigdata_t5 GROUP BY month,day;```- CUBE 根據GROUP BY的維度的所有組合進行聚合。``` SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day WITH CUBE ORDER BY GROUPING__ID; 等價於 SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM bigdata_t5 UNION ALL SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM bigdata_t5 GROUP BY month UNION ALL SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM bigdata_t5 GROUP BY day UNION ALL SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM bigdata_t5 GROUP BY month,day;```- ROLLUP 是CUBE的子集,以最左側的維度為主,從該維度進行層級聚合。``` 比如,以month維度進行層級聚合: SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY month,day WITH ROLLUP ORDER BY GROUPING__ID; --把month和day調換順序,則以day維度進行層級聚合: SELECT day, month, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID FROM bigdata_t5 GROUP BY day,month WITH ROLLUP ORDER BY GROUPING__ID; (這裡,根據天和月進行聚合,和根據天聚合結果一樣,因為有父子關係,如果是其他維度組合的話,就會不一樣)```#### **搜尋公眾號:五分鐘學大資料,獲取大資料學習祕籍,大資料能力將實現質的
版权声明
本文为[itread01]所创,转载请带上原文链接,感谢
https://www.itread01.com/content/1611248644.html

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云