Hive common tuning methods & two interview SQL

Hoult, Wu Xie 2021-02-23 17:48:24
hive common tuning methods interview


Hive As a common data warehouse component in the field of big data , We need to pay attention to efficiency in the design and development phase . influence Hive It's not just the amount of data that's efficient ; Data skew 、 data redundancy 、job or I/O Too much 、MapReduce Unreasonable distribution and other factors are all right Hive It has an impact on the efficiency of . Yes Hive The tuning of includes both the optimization of HiveQL Optimization of the statement itself , Also contains Hive Configuration items and MR The tone of the game
whole .

From the following three aspects :
Structure optimization
Parameter optimization
SQL Optimize

1. In terms of Architecture

In terms of execution engine, it is aimed at the resources of the platform within the company , Choose a better and faster engine , such as MR、TEZ、Spark etc. ,

If the choice is TEZ engine , You can turn on the vectorized optimizer at optimizer time , In addition, you can choose the cost optimizer CBO, The configuration is as follows :

set hive.vectorized.execution.enabled = true; -
- Default false
set hive.vectorized.execution.reduce.enabled = true; -
- Default false
SET hive.cbo.enable=true; -- from v0.14.0 Default
true
SET hive.compute.query.using.stats=true; -- Default false
SET hive.stats.fetch.column.stats=true; -- Default false
SET hive.stats.fetch.partition.stats=true; -- Default true

Optimize the design of the table , For example, select a partition table , Bucket watch , And the storage format of the table , To reduce data transmission , You can use compression , Here are some parameters ( More parameters can be found on the official website )

-- Intermediate result compression
SET
hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec ;
-- The output is compressed
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec =org.apache.hadoop.io.compress.SnappyCodc

2. Parameter optimization

The second part is parameter optimization , In fact, the above architecture part , Some of them are also controlled by parameters , The parameter control of this part mainly includes the following aspects

Local mode 、 Strict mode 、JVM reusing 、 Parallel execution 、 Speculative execution 、 Merge small files 、Fetch Pattern

2.1 Local mode

When the amount of data is small , It's slow to start distributed data processing , It takes a long time to start , Not as fast as the local model , Use the following parameters to adjust

SET hive.exec.mode.local.auto=true; -- Default false
Small
SET hive.exec.mode.local.auto.inputbytes.max=50000000; -- The size of the input file is less than hive.exec.mode.local.auto.inputbytes.max The configuration is big
SET hive.exec.mode.local.auto.input.files.max=5; -- Default 4 map The number of tasks is less than hive.exec.mode.local.auto.input.files.max Configured
size

2.2 Strict mode

It's actually a switch , When the following three statements are satisfied , Will fail , If it is not turned on, it will be executed normally , After opening, let these statements fail automatically

hive.mapred.mode=nostrict
-- When querying a partitioned table, the statement that does not restrict the partitioned column ;
-- The two tables join The statement that produces the Cartesian product ;
-- use order by Sort by , But there is no designation limit The sentence of

2.3 Jvm reusing

stay mr Inside , It's in progress , A process is a Jvm, It's like short homework , These processes can be reused very quickly , But its disadvantage is that it will wait until the task is finished task slot , This is more obvious when the data is skewed . Turn this on with the following parameters

SET mapreduce.job.jvm.numtasks=5;

2.4 Parallel execution

Hive The query will be changed to stage, these stage It's not interdependent , These can be executed in parallel stage, Use the following parameters

SET hive.exec.parallel=true; -- Default false
SET hive.exec.parallel.thread.number=16; -- Default 8

2.5 Speculative execution

The function of this parameter is , Use space resources in exchange for time to get the final result , For example, because of the Internet , The reasons for resource inequality , Some tasks are particularly slow , Will start the backup process to process the same data , Finally, the first successful calculation result is selected as the final result .

set mapreduce.map.speculative=true
set mapreduce.reduce.speculative=true
set hive.mapred.reduce.tasks.speculative.execution=true

2.6 Merge small files

stay map Perform the front , First merge small files to reduce map Count

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

At the end of the mission , Merge small files

# stay map-only Merge small files at the end of the task , Default true
SET hive.merge.mapfiles = true;
# stay map-reduce Merge small files at the end of the task , Default false
SET hive.merge.mapredfiles = true;
# The size of the merged file , Default 256M
SET hive.merge.size.per.task = 268435456;
# When the average size of the output file is less than this value , Start an independent map-reduce Task to file merge
SET hive.merge.smallfiles.avgsize = 16777216;

2.7 Fetch Pattern

The last one fetch Pattern , In some cases, try not to run mr, For example, query several fields , Global search , Field search ,limit Check the situation

hive.fetch.task.conversion=more

3.sql Optimize

This part is more complicated , It may involve data skewing , As for data skew, it has always been an unavoidable problem in big data processing , There are many ways to deal with it

3.1 sql Optimize

sql Optimization is the easiest part for developers to control , It's often experience that makes it , To sum up, the following way

Column , Partition dismantling ,sort by Instead of order by, group by Instead of count(distinct) ,group by Prepolymerization of ( To control by parameters ), Tilt configuration items ,map join, Filter null values separately , To adjust properly map and reduces Count , These are almost all met in the work , To optimize them as much as possible is what you need to do

3.2 Tilt balance configuration item

This configuration and group by The tilt equilibrium configuration terms of are the same , adopt hive.optimize.skewjoin To configure the , Default false. If it's on , stay join In the process Hive The count will exceed the threshold hive.skewjoin.key ( Default 100000) Inclination key The corresponding line is temporarily written into the file , And then start another job do map join Generate results . adopt hive.skewjoin.mapjoin.map.tasks Parameters can also control the second job Of mapper Number , Default 1000

3.3 Handle the tilt separately key

If it's tilted key It has practical significance , Generally speaking, inclined key Very few , At this point, they can be extracted separately , The corresponding row is stored separately in the temporary table , Then prefix it with a smaller random number ( such as 0~9), Finally, we can aggregate . Don't have one Select In the sentence , Write too much Join. Be sure to understand the business , Understand the data .(A0-A9) Split into multiple statements , Step by step ;(A0-A4; A5-A9); First perform the association of large table and small table ;

4. Two SQL

4.1 Find out all of them 3 A coherent team

team,year
piston ,1990
Bull ,1991
Bull ,1992


--
-- 1 ranking
select team, year,
row_number() over (partition by team order by year) as rank
from t1;
-- 2 Get groups id
select team, year,
row_number() over (partition by team order by year) as rank,
(year -row_number() over (partition by team order by year)) as groupid
from t1;
-- 3 Group solution
select team, count(1) years
from (select team,
(year -row_number() over (partition by team order by year)) as groupid
from t1
) tmp
group by team, groupid
having count(1) >= 3;

4.2 Find out each id All the peaks and troughs in a day

peak :
The value of this moment > The value of the previous moment
The value of this moment > The value of the next moment
Trough :
The value of this moment < The value of the previous moment
The value of this moment < The value of the next moment
id time price The value of the previous moment (lag) The value of the next moment (lead)
sh66688, 9:35, 29.48 null 28.72
sh66688, 9:40, 28.72 29.48 27.74
sh66688, 9:45, 27.74
sh66688, 9:50, 26.75
sh66688, 9:55, 27.13
sh66688, 10:00, 26.30
sh66688, 10:05, 27.09
sh66688, 10:10, 26.46
sh66688, 10:15, 26.11
sh66688, 10:20, 26.88
sh66688, 10:25, 27.49
sh66688, 10:30, 26.70
sh66688, 10:35, 27.57
sh66688, 10:40, 28.26
sh66688, 10:45, 28.03

-- Ideas : The key is to find the characteristics of the peaks and troughs
-- The characteristics of the crest : Larger than the previous period 、 The value of the next period
-- The characteristics of the trough : Less than the previous period 、 The value of the next period
-- Find this feature SQL It's easy to write
select id, time, price,
case when price > beforeprice and price > afterprice then " peak "
when price < beforeprice and price < afterprice then " Trough " end as feature
from (select id, time, price,
lag(price) over (partition by id order by time) beforeprice,
lead(price) over (partition by id order by time) afterprice
from t2
)tmp
where (price > beforeprice and price > afterprice) or
(price < beforeprice and price < afterprice);

Wu Xie , Third master , Backstage , big data , A rookie in the field of artificial intelligence .
Please pay more attention to
file

版权声明
本文为[Hoult, Wu Xie]所创,转载请带上原文链接,感谢
https://javamana.com/2021/02/20210223174634706K.html

  1. 开源办公开发平台丨Mysql5.7两套四节点主从结构环境搭建教程(二)
  2. 开源办公开发平台丨Mysql5.7两套四节点主从结构环境搭建教程(一)
  3. Open source office development platform mysql5.7 two sets of four node master-slave structure environment building tutorial (2)
  4. HTTP的“无状态”和REST的“状态转换”
  5. Open source office development platform mysql5.7 two sets of four node master-slave structure environment building tutorial (1)
  6. 【大数据哔哔集20210128】使用Hive计算环比和同比
  7. 【大数据哔哔集20210125】Kafka将逐步弃用对zookeeper的依赖
  8. 【大数据哔哔集20210124】有人问我Kafka Leader选举?我真没慌
  9. 【大数据哔哔集20210123】别问,问就是Kafka高可靠
  10. Spring 事务、异步和循环依赖有什么关系?
  11. Spring 动态代理时是如何解决循环依赖的?为什么要使用三级缓存?
  12. "Stateless" of HTTP and "state transition" of rest
  13. [big data bibiji 20210128] use hive to calculate month on month and year on year
  14. [big data bibiji 20210125] Kafka will gradually abandon its dependence on zookeeper
  15. [big data beeps 20210124] someone asked me about Kafka leader election? I'm not in a panic
  16. [big data bibiji 20210123] don't ask, ask is Kafka highly reliable
  17. jQuery Gantt Package 在Visual Studio中创建一个新的ASP.NET项目
  18. What is the relationship between spring transactions, asynchrony, and circular dependencies?
  19. How to solve circular dependency in spring dynamic proxy? Why use level 3 caching?
  20. Unicode码的二进制转换(Java)
  21. JAVA的函数式接口
  22. JAVA里的元祖类
  23. JQuery Gantt package to create a new ASP.NET project
  24. Binary conversion of Unicode code (Java)
  25. The functional interface of Java
  26. Yuanzu class in Java
  27. Java中的CPU占用高和内存占用高的问题排查
  28. HashMap连环问你能答出几道?
  29. IntelliJ IDEA 还能画思维导图,果然最强 IDE!
  30. java中的反射和代理
  31. Troubleshooting of high CPU and memory usage in Java
  32. How many questions can you answer?
  33. IntelliJ idea can also draw mind maps. It's really the strongest ide!
  34. Reflection and proxy in Java
  35. Java中的CPU占用高和内存占用高的问题排查
  36. Linux OOM(out of memory)
  37. mysql 自定义函数因参数名称报错
  38. Troubleshooting of high CPU and memory usage in Java
  39. Linux OOM(out of memory)
  40. MySQL user defined function error due to parameter name
  41. echarts-gl 3D 地图柱状图可视化GDP
  42. Visualization of histogram of ecarts GL 3D map
  43. 金三银四如何应对Redis面试,一文深入Redis实战实践!
  44. 阿里资深架构师定制金三银四面试整理出来的一份Java核心知识点.pdf
  45. 为什么Java开发工程师工资高,却很多人想转行?
  46. How to deal with the interview of redis!
  47. Ali senior architect customizes a Java core knowledge point sorted out in the interview of golden, silver and four.pdf
  48. Why do java development engineers have high salaries, but many people want to change careers?
  49. 用注解开发SpringMVC
  50. Developing spring MVC with annotations
  51. 编译redis报错/deps/hiredis/libhiredis.a解决
  52. Compile redis report error / DEPs / hirredis / libhirredis. A solution
  53. 用注解开发SpringMVC
  54. Developing spring MVC with annotations
  55. Spring学习笔记-01
  56. Centos安装和卸载docker
  57. Spring learning notes-01
  58. Installing and uninstalling docker in CentOS
  59. Java基础-异常
  60. Java基础-反射