Hive common tuning methods & two interview SQL

Hoult_ Wu Xie 2021-02-23 18:15:25
hive common tuning methods interview

Hive As a common data warehouse component in the field of big data , We need to pay attention to efficiency in the design and development phase . influence Hive It's not just the amount of data that's efficient ; Data skew 、 data redundancy 、job or I/O Too much 、MapReduce Unreasonable distribution and other factors are all right Hive It has an impact on the efficiency of . Yes Hive The tuning of includes both the optimization of HiveQL Optimization of the statement itself , Also contains Hive Configuration items and MR The tone of the game whole .

From the following three aspects : Structure optimization Parameter optimization SQL Optimize

1. In terms of Architecture

In terms of execution engine, it is aimed at the resources of the platform within the company , Choose a better and faster engine , such as MR、TEZ、Spark etc. ,

If the choice is TEZ engine , You can turn on the vectorized optimizer at optimizer time , In addition, you can choose the cost optimizer CBO, The configuration is as follows :

set hive.vectorized.execution.enabled = true; -
- Default false
set hive.vectorized.execution.reduce.enabled = true; -
- Default false
SET hive.cbo.enable=true; -- from v0.14.0 Default 
SET hive.compute.query.using.stats=true; -- Default false
SET hive.stats.fetch.column.stats=true; -- Default false
SET hive.stats.fetch.partition.stats=true; -- Default true 

Optimize the design of the table , For example, select a partition table , Bucket watch , And the storage format of the table , To reduce data transmission , You can use compression , Here are some parameters ( More parameters can be found on the official website )

-- Intermediate result compression 
-- The output is compressed 
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec

2. Parameter optimization

The second part is parameter optimization , In fact, the above architecture part , Some of them are also controlled by parameters , The parameter control of this part mainly includes the following aspects

Local mode 、 Strict mode 、JVM reusing 、 Parallel execution 、 Speculative execution 、 Merge small files 、Fetch Pattern

2.1 Local mode

When the amount of data is small , It's slow to start distributed data processing , It takes a long time to start , Not as fast as the local model , Use the following parameters to adjust

SET; -- Default false 
SET; -- The size of the input file is less than The configuration is big 
SET; -- Default 4 map The number of tasks is less than Configured 

2.2 Strict mode

It's actually a switch , When the following three statements are satisfied , Will fail , If it is not turned on, it will be executed normally , After opening, let these statements fail automatically

-- When querying a partitioned table, the statement that does not restrict the partitioned column ;
-- The two tables join The statement that produces the Cartesian product ;
-- use order by Sort by , But there is no designation limit The sentence of 

2.3 Jvm reusing

stay mr Inside , It's in progress , A process is a Jvm, It's like short homework , These processes can be reused very quickly , But its disadvantage is that it will wait until the task is finished task slot , This is more obvious when the data is skewed . Turn this on with the following parameters

SET mapreduce.job.jvm.numtasks=5;

2.4 Parallel execution

Hive The query will be changed to stage, these stage It's not interdependent , These can be executed in parallel stage, Use the following parameters

SET hive.exec.parallel=true; -- Default false
SET hive.exec.parallel.thread.number=16; -- Default 8

2.5 Speculative execution

The function of this parameter is , Use space resources in exchange for time to get the final result , For example, because of the Internet , The reasons for resource inequality , Some tasks are particularly slow , Will start the backup process to process the same data , Finally, the first successful calculation result is selected as the final result .

set mapreduce.reduce.speculative=true
set hive.mapred.reduce.tasks.speculative.execution=true

2.6 Merge small files

stay map Perform the front , First merge small files to reduce map Count


At the end of the mission , Merge small files

# stay map-only Merge small files at the end of the task , Default true
SET hive.merge.mapfiles = true;
# stay map-reduce Merge small files at the end of the task , Default false
SET hive.merge.mapredfiles = true;
# The size of the merged file , Default 256M
SET hive.merge.size.per.task = 268435456;
# When the average size of the output file is less than this value , Start an independent map-reduce Task to file merge
SET hive.merge.smallfiles.avgsize = 16777216;

2.7 Fetch Pattern

The last one fetch Pattern , In some cases, try not to run mr, For example, query several fields , Global search , Field search ,limit Check the situation


3.sql Optimize

This part is more complicated , It may involve data skewing , As for data skew, it has always been an unavoidable problem in big data processing , There are many ways to deal with it

3.1 sql Optimize

sql Optimization is the easiest part for developers to control , It's often experience that makes it , To sum up, the following way

Column , Partition dismantling ,sort by Instead of order by, group by Instead of count(distinct) ,group by Prepolymerization of ( To control by parameters ), Tilt configuration items ,map join, Filter null values separately , To adjust properly map and reduces Count , These are almost all met in the work , To optimize them as much as possible is what you need to do

3.2 Tilt balance configuration item

This configuration and group by The tilt equilibrium configuration terms of are the same , adopt hive.optimize.skewjoin To configure the , Default false. If it's on , stay join In the process Hive The count will exceed the threshold hive.skewjoin.key ( Default 100000) Inclination key The corresponding line is temporarily written into the file , And then start another job do map join Generate results . adopt Parameters can also control the second job Of mapper Number , Default 1000

3.3 Handle the tilt separately key

If it's tilted key It has practical significance , Generally speaking, inclined key Very few , At this point, they can be extracted separately , The corresponding row is stored separately in the temporary table , Then prefix it with a smaller random number ( such as 0~9), Finally, we can aggregate . Don't have one Select In the sentence , Write too much Join. Be sure to understand the business , Understand the data .(A0-A9) Split into multiple statements , Step by step ;(A0-A4; A5-A9); First perform the association of large table and small table ;

4. Two SQL

4.1 Find out all of them 3 A coherent team

team,year piston ,1990 Bull ,1991 Bull ,1992

-- 1 ranking 
select team, year,
row_number() over (partition by team order by year) as rank
from t1;
-- 2 Get groups id
select team, year,
row_number() over (partition by team order by year) as rank,
(year -row_number() over (partition by team order by year)) as groupid
from t1;
-- 3 Group solution 
select team, count(1) years
from (select team,
(year -row_number() over (partition by team order by year)) as groupid
from t1
) tmp
group by team, groupid
having count(1) >= 3;

4.2 Find out each id All the peaks and troughs in a day

peak : The value of this moment > The value of the previous moment The value of this moment > The value of the next moment Trough : The value of this moment < The value of the previous moment The value of this moment < The value of the next moment id time price The value of the previous moment (lag) The value of the next moment (lead) sh66688, 9:35, 29.48 null 28.72 sh66688, 9:40, 28.72 29.48 27.74 sh66688, 9:45, 27.74
sh66688, 9:50, 26.75
sh66688, 9:55, 27.13 sh66688, 10:00, 26.30 sh66688, 10:05, 27.09 sh66688, 10:10, 26.46 sh66688, 10:15, 26.11 sh66688, 10:20, 26.88 sh66688, 10:25, 27.49 sh66688, 10:30, 26.70 sh66688, 10:35, 27.57 sh66688, 10:40, 28.26 sh66688, 10:45, 28.03

-- Ideas : The key is to find the characteristics of the peaks and troughs 
-- The characteristics of the crest : Larger than the previous period 、 The value of the next period 
-- The characteristics of the trough : Less than the previous period 、 The value of the next period 
-- Find this feature SQL It's easy to write 
select id, time, price,
case when price > beforeprice and price > afterprice then " peak "
when price < beforeprice and price < afterprice then " Trough " end as feature
from (select id, time, price,
lag(price) over (partition by id order by time) beforeprice,
lead(price) over (partition by id order by time) afterprice
from t2
where (price > beforeprice and price > afterprice) or
(price < beforeprice and price < afterprice);

Big data development , Pay more attention to see your profile

本文为[Hoult_ Wu Xie]所创,转载请带上原文链接,感谢

  1. JAVA的函数式接口
  2. JAVA里的元祖类
  3. JQuery Gantt package to create a new ASP.NET project
  4. Binary conversion of Unicode code (Java)
  5. The functional interface of Java
  6. Yuanzu class in Java
  7. Java中的CPU占用高和内存占用高的问题排查
  8. HashMap连环问你能答出几道?
  9. IntelliJ IDEA 还能画思维导图,果然最强 IDE!
  10. java中的反射和代理
  11. Troubleshooting of high CPU and memory usage in Java
  12. How many questions can you answer?
  13. IntelliJ idea can also draw mind maps. It's really the strongest ide!
  14. Reflection and proxy in Java
  15. Java中的CPU占用高和内存占用高的问题排查
  16. Linux OOM(out of memory)
  17. mysql 自定义函数因参数名称报错
  18. Troubleshooting of high CPU and memory usage in Java
  19. Linux OOM(out of memory)
  20. MySQL user defined function error due to parameter name
  21. echarts-gl 3D 地图柱状图可视化GDP
  22. Visualization of histogram of ecarts GL 3D map
  23. 金三银四如何应对Redis面试,一文深入Redis实战实践!
  24. 阿里资深架构师定制金三银四面试整理出来的一份Java核心知识点.pdf
  25. 为什么Java开发工程师工资高,却很多人想转行?
  26. How to deal with the interview of redis!
  27. Ali senior architect customizes a Java core knowledge point sorted out in the interview of golden, silver and four.pdf
  28. Why do java development engineers have high salaries, but many people want to change careers?
  29. 用注解开发SpringMVC
  30. Developing spring MVC with annotations
  31. 编译redis报错/deps/hiredis/libhiredis.a解决
  32. Compile redis report error / DEPs / hirredis / libhirredis. A solution
  33. 用注解开发SpringMVC
  34. Developing spring MVC with annotations
  35. Spring学习笔记-01
  36. Centos安装和卸载docker
  37. Spring learning notes-01
  38. Installing and uninstalling docker in CentOS
  39. Java基础-异常
  40. Java基础-反射
  41. Java基础-继承
  42. k8s部署 (进行中)
  43. Hive-常见调优方式 && 两个面试sql
  44. 死磕Spring之IoC篇 - BeanDefinition 的加载阶段(XML 文件)
  45. Java basics exception
  46. Java Basics - Reflection
  47. Java Basics - inheritance
  48. K8s deployment (in progress)
  49. Hive common tuning methods & two interview SQL
  50. The loading phase of beandefinition (XML file)
  51. 死磕Spring之IoC篇 - BeanDefinition 的加载阶段(XML 文件)
  52. Hive-常见调优方式 && 两个面试sql
  53. The loading phase of beandefinition (XML file)
  54. Hive common tuning methods & two interview SQL
  55. iconv文件编码转换由windows文件放到linux下
  56. The code conversion of iconv file is put into Linux by windows file
  57. SpringBoot2+intellij IDEA开发前环境准备
  58. Preparation of pre development environment for springboot2 + IntelliJ idea
  59. Docker私有仓库部署
  60. Docker private warehouse deployment