Fetch of hive performance tuning (1)

Homo sapiens 2021-01-22 18:06:50
fetch hive performance tuning


We are just beginning to learn hive When , We all know hive It can reduce the learning cost and development cost of programmers , The specific performance is that it can SQL Statement conversion to MapReduce The program runs .

however Hive Some cases of the query may not be used MapReduce Calculation . for example :SELECT * FROM employees; under these circumstances ,Hive It can be read simply employee The corresponding storage directory under the file , Then output the query results to the console .

stay hive-default.xml.template In file hive.fetch.task.conversion The default is more, The old version hive The default is minimal, The property is changed to more in the future , Global lookup 、 The field lookup 、limit Search and so on mapreduce.

<property>
<name>hive.fetch.task.conversion</name>
<value>more</value>
<description>
Expects one of [none, minimal, more].
Some select queries can be converted to single FETCH task minimizing latency.
Currently the query should be single sourced not having any subquery and should not have
any aggregations or distincts (which incurs RS), lateral views and joins.
0. none : disable hive.fetch.task.conversion
</description>
</property>

Let's go through some practical exercises , Let's see the effect !

Case practice <1> hold hive.fetch.task.conversion Set to none, The query statement is then executed , It will be carried out mapreduce Program .

hive (default)> set hive.fetch.task.conversion=none;
hive (default)> select * from score;
hive (default)> select s_score from score;
hive (default)> select s_score from score limit 3;

<2> hold hive.fetch.task.conversion Set to more, The query statement is then executed , None of the following query methods will be executed mapreduce Program .

hive (default)> set hive.fetch.task.conversion=more;
hive (default)> select * from score;
hive (default)> select s_score from score;
hive (default)> select s_score from score limit 3;

We can clearly see that when hive.fetch.task.conversion Set to none, All the procedures go mapreduce The program takes a certain amount of time . But even if it's set to none, Only part of it sql The sentence will not go mapreduce Program , Is there any way to optimize this problem ?

Local mode

Most of the Hadoop Job Is the need to Hadoop Provides full scalability to handle large data sets . however , Sometimes Hive The amount of input data is very small . under these circumstances , It may be more expensive to trigger a task for a query than it actually is job The execution time is much more . For most of these cases ,Hive All tasks can be handled on a single machine in local mode . For small data sets , Using local mode, execution time can be significantly reduced .

Users can use the Settings hive.exec.mode.local.auto The value of is true, To make the Hive Automatically start this optimization when appropriate . set hive.exec.mode.local.auto=true; // Open the local mr

Set up local mr Is the maximum amount of input data , When the amount of input data is less than this value local mr The way , The default is 134217728, namely 128M set hive.exec.mode.local.auto.inputbytes.max=51234560;

Set up local mr The maximum number of input files , Used when the number of input files is less than this value local mr The way , The default is 4 set hive.exec.mode.local.auto.input.files.max=10;

Case practice :

<1> Enable local mode , And execute the query statement

hive (default)> set hive.exec.mode.local.auto=true;
hive (default)> select * from score cluster by s_id;

18 rows selected (1.568 seconds)

<2> Turn off local mode , And execute the query statement

hive (default)> set hive.exec.mode.local.auto=false;
hive (default)> select * from score cluster by s_id;

18 rows selected (11.865 seconds)

Group By

By default ,Map Stages of the same Key The data is distributed to one reduce, When one key The data is skewed when it's too large .

Not all aggregation operations are required Reduce The complete , Many aggregation operations can be performed first Map The end is partially polymerized , Last in Reduce The end result is obtained .

<1> Turn on Map End aggregation parameter Settings set hive.map.aggr = true;

<2> stay Map The number of entries to be aggregated at the end set hive.groupby.mapaggr.checkinterval = 100000;

<3> Load balancing when there is data skew ( The default is false) set hive.groupby.skewindata = true;

The selected item is set to true, The generated query plan will have two MR Job. first MR Job in ,Map The output will be randomly distributed to Reduce in , Every Reduce Do partial aggregation , And output the result , The result of this treatment is the same Group By Key May be distributed to different Reduce in , So as to achieve the goal of load balancing ; the second MR Job And then according to the data result of preprocessing according to Group By Key Distributed to Reduce in ( This process is guaranteed to be the same Group By Key It's distributed to the same Reduce in ), The final aggregation operation is completed .

Count(distinct)

It doesn't matter when the data is small , When the amount of data is large , because COUNT DISTINCT The operation needs to use one Reduce Task To complete , This one Reduce Too much data to process , It's going to lead to the whole thing Job It is difficult to complete , commonly COUNT DISTINCT Use the first GROUP BY Again COUNT Mode substitution :

Environmental preparation :

create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
load data local inpath '/home/admin/softwares/data/100 Ten thousand big table data (id Divide 10 integer )/bigtable' into table bigtable;
set hive.exec.reducers.bytes.per.reducer=32123456;

test : SELECT count(DISTINCT id) FROM bigtable;

result : c0 10000 Time taken: 35.49 seconds, Fetched: 1 row(s)

It can be converted into :

set hive.exec.reducers.bytes.per.reducer=32123456; SELECT count(id) FROM (SELECT id FROM bigtable GROUP BY id) a; result : Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 13.07 sec HDFS Read: 120749896 HDFS Write: 464 SUCCESS Stage-Stage-2: Map: 3 Reduce: 1 Cumulative CPU: 5.14 sec HDFS Read: 8987 HDFS Write: 7 SUCCESS _c0 10000 Time taken: 51.202 seconds, Fetched: 1 row(s)

Although there will be one more Job To complete , But with a lot of data , It's definitely worth it .

The cartesian product

Try to avoid cartesian products , That is to avoid join I don't add on Conditions , Or invalid on Conditions ,Hive Only use 1 individual reducer To complete the cartesian product .

Use partition clipping 、 Column cut

stay SELECT in , Just take the columns you need , If there is , Use partition filtering whenever possible , To use less SELECT *.

In the partition clipping , When using external associations , If the filter condition of the side table is written in Where Back , So it's going to be a full table association first , And then we filter it , such as :

Environmental preparation :

create table ori(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
load data local inpath '/home/admin/softwares/data/ Increment by increment id Raw data /ori' into table ori;
load data local inpath '/home/admin/softwares/data/100 Ten thousand big table data (id Divide 10 integer )/bigtable' into table bigtable;

First connect and then Where:

SELECT a.id
FROM bigtable a
LEFT JOIN ori b ON a.id = b.id
WHERE b.id <= 10;

The correct way to write it is to write it in ON Back : First Where Again

SELECT a.id
FROM ori a
LEFT JOIN bigtable b ON (b.id <= 10 AND a.id = b.id);

Or write it as a subquery :

SELECT a.id
FROM bigtable a
RIGHT JOIN (SELECT id
FROM ori
WHERE id <= 10
) b ON a.id = b.id;

Dynamic partition adjustment

In relational database , The partition table Insert When the data , The database will automatically partition fields based on their values , Inserts the data into the corresponding partition ,Hive A similar mechanism is provided in , Dynamic partitioning (Dynamic Partition), It's just , Use Hive Dynamic partitioning of , It needs to be configured accordingly . With the partition rule of the first table , To correspond to the partition rule of the second table , All partitions of the first table , Copy it all to the second table , When the second table loads data , There's no need to specify a partition , Just use the partition of the first table

<1> Turn on dynamic partition parameter Settings ①set hive.exec.dynamic.partition=true;

② Set to non-strict mode ( A pattern of dynamic partitioning , Default strict, Indicates that at least one partition must be specified as a static partition ,nonstrict Pattern representation allows all partitioned fields to be dynamically partitioned .) set hive.exec.dynamic.partition.mode=nonstrict;

③ At all execution MR Node , Maximum number of dynamic partitions can be created . set hive.exec.max.dynamic.partitions=1000;

④ At each execution MR Node , How many dynamic partitions can be created . This parameter needs to be set according to the actual data . such as : The source data contains a year's worth of data , namely day Field has 365 It's worth , Then this parameter needs to be set to greater than 365, If default values are used 100, May be an error . set hive.exec.max.dynamic.partitions.pernode=100

⑤ Whole MR Job in , How many can be created HDFS file . stay linux In the system , Every linux Users can turn on at most 1024 A process , Each process can be opened at most 2048 File , That is to hold 2048 File handles , The larger the value below , The larger the file handle is, the larger the file handle is set hive.exec.max.created.files=100000;

⑥ When a free partition is generated , Whether to throw an exception . You don't usually need to set it . set hive.error.on.empty.partition=false;

<2> Case practice demand : take ori Data by time ( Such as :20111231234568), Insert into target table ori_partitioned In the corresponding partition of .

① Prepare the original data sheet

create table ori_partitioned(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string)
PARTITIONED BY (p_time bigint)
row format delimited fields terminated by '\t';
load data local inpath '/export/servers/hivedatas/small_data' into table ori_partitioned partition (p_time='20111230000010');
load data local inpath '/export/servers/hivedatas/small_data' into table ori_partitioned partition (p_time='20111230000011');

② Create a partition table

create table ori_partitioned_target(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) PARTITIONED BY (p_time STRING) row format delimited fields terminated by '\t';

③ analysis If you specify a partition as described before Insert data , So this requirement is not easy to achieve . At this time, we need to use dynamic partition to achieve .

set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.max.dynamic.partitions = 1000;
set hive.exec.max.dynamic.partitions.pernode = 100;
set hive.exec.max.created.files = 100000;
set hive.error.on.empty.partition = false;
INSERT overwrite TABLE ori_partitioned_target PARTITION (p_time)
SELECT id, time, uid, keyword, url_rank, click_num, click_url, p_time
FROM ori_partitioned;

Be careful : stay PARTITION (month,day) Specify the partition field name in ;

stay SELECT The last few fields of the clause , Must correspond to the front PARTITION (month,day) The partition field specified in , Including the order .

Check the partition :

hive> show partitions ori_partitioned_target;
OK
p_time=20111230000010
p_time=20111230000011

So much for this sharing , In the future, small bacteria will bring you Hive Other series of performance tuning , Coming soon ! Those who benefit from it, please remember to like it ٩(๑*ᴗ*๑)۶

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[Homo sapiens]所创,转载请带上原文链接,感谢
https://javamana.com/2021/01/20210122180032274m.html

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云