Hive As a common data warehouse component in the field of big data , We need to pay attention to efficiency in the design and development phase . influence Hive It's not just the amount of data that's efficient ; Data skew 、 data redundancy 、job or I/O Too much 、MapReduce Unreasonable distribution and other factors are all right Hive It has an impact on the efficiency of . Yes Hive The tuning of includes both the optimization of HiveQL Optimization of the statement itself , Also contains Hive Configuration items and MR The tone of the game

whole .

From the following three aspects :

Structure optimization

Parameter optimization

SQL Optimize

1. In terms of Architecture

In terms of execution engine, it is aimed at the resources of the platform within the company , Choose a better and faster engine , such as MR、TEZ、Spark etc. ,

If the choice is TEZ engine , You can turn on the vectorized optimizer at optimizer time , In addition, you can choose the cost optimizer CBO, The configuration is as follows :

set hive.vectorized.execution.enabled = true; -
- Default false
set hive.vectorized.execution.reduce.enabled = true; -
- Default false
SET hive.cbo.enable=true; -- from v0.14.0 Default
true
SET hive.compute.query.using.stats=true; -- Default false
SET hive.stats.fetch.column.stats=true; -- Default false
SET hive.stats.fetch.partition.stats=true; -- Default true

Optimize the design of the table , For example, select a partition table , Bucket watch , And the storage format of the table , To reduce data transmission , You can use compression , Here are some parameters ( More parameters can be found on the official website )

-- Intermediate result compression 
SET
hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec ;
-- The output is compressed
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec =org.apache.hadoop.io.compress.SnappyCodc

2. Parameter optimization

The second part is parameter optimization , In fact, the above architecture part , Some of them are also controlled by parameters , The parameter control of this part mainly includes the following aspects

Local mode 、 Strict mode 、JVM reusing 、 Parallel execution 、 Speculative execution 、 Merge small files 、Fetch Pattern

2.1 Local mode

When the amount of data is small , It's slow to start distributed data processing , It takes a long time to start , Not as fast as the local model , Use the following parameters to adjust

SET hive.exec.mode.local.auto=true; -- Default false
Small
SET hive.exec.mode.local.auto.inputbytes.max=50000000; -- The size of the input file is less than hive.exec.mode.local.auto.inputbytes.max The configuration is big
SET hive.exec.mode.local.auto.input.files.max=5; -- Default 4 map The number of tasks is less than hive.exec.mode.local.auto.input.files.max Configured
size

2.2 Strict mode

It's actually a switch , When the following three statements are satisfied , Will fail , If it is not turned on, it will be executed normally , After opening, let these statements fail automatically

hive.mapred.mode=nostrict
-- When querying a partitioned table, the statement that does not restrict the partitioned column ;
-- The two tables join The statement that produces the Cartesian product ;
-- use order by Sort by , But there is no designation limit The sentence of

2.3 Jvm reusing

stay mr Inside , It's in progress , A process is a Jvm, It's like short homework , These processes can be reused very quickly , But its disadvantage is that it will wait until the task is finished task slot , This is more obvious when the data is skewed . Turn this on with the following parameters

SET mapreduce.job.jvm.numtasks=5;

2.4 Parallel execution

Hive The query will be changed to stage, these stage It's not interdependent , These can be executed in parallel stage, Use the following parameters

SET hive.exec.parallel=true; -- Default false
SET hive.exec.parallel.thread.number=16; -- Default 8

2.5 Speculative execution

The function of this parameter is , Use space resources in exchange for time to get the final result , For example, because of the Internet , The reasons for resource inequality , Some tasks are particularly slow , Will start the backup process to process the same data , Finally, the first successful calculation result is selected as the final result .

set mapreduce.map.speculative=true
set mapreduce.reduce.speculative=true
set hive.mapred.reduce.tasks.speculative.execution=true

2.6 Merge small files

stay map Perform the front , First merge small files to reduce map Count

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

At the end of the mission , Merge small files

# stay map-only Merge small files at the end of the task , Default true
SET hive.merge.mapfiles = true;
# stay map-reduce Merge small files at the end of the task , Default false
SET hive.merge.mapredfiles = true;
# The size of the merged file , Default 256M
SET hive.merge.size.per.task = 268435456;
# When the average size of the output file is less than this value , Start an independent map-reduce Task to file merge
SET hive.merge.smallfiles.avgsize = 16777216;

2.7 Fetch Pattern

The last one fetch Pattern , In some cases, try not to run mr, For example, query several fields , Global search , Field search ,limit Check the situation

hive.fetch.task.conversion=more

3.sql Optimize

This part is more complicated , It may involve data skewing , As for data skew, it has always been an unavoidable problem in big data processing , There are many ways to deal with it

3.1 sql Optimize

sql Optimization is the easiest part for developers to control , It's often experience that makes it , To sum up, the following way

Column , Partition dismantling ,sort by Instead of order by, group by Instead of count(distinct) ,group by Prepolymerization of ( To control by parameters ), Tilt configuration items ,map join, Filter null values separately , To adjust properly map and reduces Count , These are almost all met in the work , To optimize them as much as possible is what you need to do

3.2 Tilt balance configuration item

This configuration and group by The tilt equilibrium configuration terms of are the same , adopt hive.optimize.skewjoin To configure the , Default false. If it's on , stay join In the process Hive The count will exceed the threshold hive.skewjoin.key ( Default 100000) Inclination key The corresponding line is temporarily written into the file , And then start another job do map join Generate results . adopt hive.skewjoin.mapjoin.map.tasks Parameters can also control the second job Of mapper Number , Default 1000

3.3 Handle the tilt separately key

If it's tilted key It has practical significance , Generally speaking, inclined key Very few , At this point, they can be extracted separately , The corresponding row is stored separately in the temporary table , Then prefix it with a smaller random number ( such as 0~9), Finally, we can aggregate . Don't have one Select In the sentence , Write too much Join. Be sure to understand the business , Understand the data .(A0-A9) Split into multiple statements , Step by step ;(A0-A4; A5-A9); First perform the association of large table and small table ;

4. Two SQL

4.1 Find out all of them 3 A coherent team

team,year

piston ,1990

Bull ,1991

Bull ,1992


--
-- 1 ranking
select team, year,
row_number() over (partition by team order by year) as rank
from t1; -- 2 Get groups id
select team, year,
row_number() over (partition by team order by year) as rank,
(year -row_number() over (partition by team order by year)) as groupid
from t1; -- 3 Group solution
select team, count(1) years
from (select team,
(year -row_number() over (partition by team order by year)) as groupid
from t1
) tmp
group by team, groupid
having count(1) >= 3;

4.2 Find out each id All the peaks and troughs in a day

peak :

The value of this moment > The value of the previous moment

The value of this moment > The value of the next moment

Trough :

The value of this moment < The value of the previous moment

The value of this moment < The value of the next moment

id time price The value of the previous moment (lag) The value of the next moment (lead)

sh66688, 9:35, 29.48 null 28.72

sh66688, 9:40, 28.72 29.48 27.74

sh66688, 9:45, 27.74

sh66688, 9:50, 26.75

sh66688, 9:55, 27.13

sh66688, 10:00, 26.30

sh66688, 10:05, 27.09

sh66688, 10:10, 26.46

sh66688, 10:15, 26.11

sh66688, 10:20, 26.88

sh66688, 10:25, 27.49

sh66688, 10:30, 26.70

sh66688, 10:35, 27.57

sh66688, 10:40, 28.26

sh66688, 10:45, 28.03

-- Ideas : The key is to find the characteristics of the peaks and troughs 
-- The characteristics of the crest : Larger than the previous period 、 The value of the next period
-- The characteristics of the trough : Less than the previous period 、 The value of the next period
-- Find this feature SQL It's easy to write select id, time, price,
case when price > beforeprice and price > afterprice then " peak "
when price < beforeprice and price < afterprice then " Trough " end as feature
from (select id, time, price,
lag(price) over (partition by id order by time) beforeprice,
lead(price) over (partition by id order by time) afterprice
from t2
)tmp
where (price > beforeprice and price > afterprice) or
(price < beforeprice and price < afterprice);

Wu Xie , Third master , Backstage , big data , A rookie in the field of artificial intelligence .

Please pay more attention to

Hive- Common tuning methods  && Two interviews sql More articles about

  1. SQL Introduction to tuning and tuning methods

    Guide language : I had a feeling , No matter what tuning method , Indexing is the most fundamental method , It's the internal skill of all optimization techniques , So let's We'll discuss some index related tuning methods . Index is a common method to improve database performance , It enables the database server to run much faster than without an index ...

  2. Hive( Ten )Hive Performance tuning summary

    One .Fetch Grab 1. The theoretical analysis Fetch Fetching refers to ,Hive Some cases of the query may not be used MapReduce Calculation . for example :SELECT * FROM employees; under these circumstances ,Hive Can be simple ...

  3. Hive Parameter tuning

    tuning Hive Provides three ways to change environment variables , Namely : (1) modify ${HIVE_HOME}/conf/hive-site.xml The configuration file : All the default configurations are in ${HIVE_HOME}/conf/hiv ...

  4. hive Tuning of

    tuning 1 Fetch Grab (Hive It can be avoided MapReduce) Hive Some cases of the query may not be used MapReduce Calculation . for example :SELECT * FROM employees; under these circumstances ,H ...

  5. Hive Enterprise tuning

    9. Enterprise level tuning 9.1 Fetch Grab Fetch Grab :Hive Some cases of the query may not be used MapReduce Calculation : hive.fetch.task.conversion:more 9.2 ...

  6. Hadoop、Hbase Basic commands and tuning methods

    HDFS Basic commands I've been in touch with big data for a long time , The project just finished , Organize big data in your spare time hadoop.Hbase And other common commands, as well as their respective optimization methods , Think of it as a study note . HDFS Command basic format :Hadoop  fs ...

  7. Tomcat+MySQL Common tuning parameters

    One .Tomcat tuning ( One ).Tomcat Memory optimization Parameter one : vim /tomcat/bin/catalina.sh CATALINA_OPTS="-server -Xms128m -Xm ...

  8. hive tez tuning (3)

    according to . The rightmost column of the plan is a 8G VM The distribution plan of , Plan reservation 1-2G Memory for the operating system , Distribute 4G to Yarn/MapReduce, Of course, it also includes HIVE, remainder 2-3G It's when you need to use HBase It's reserved for HBase Of ...

  9. Hive performance tuning ( One )---- File storage format and compression mode selection

    Reasonable use of file storage format Build table , Use as much as possible orc.parquet These column storage formats , Because of the column storage table , Each column of data is physically stored together ,Hive The query will only traverse the required column data , Greatly reduce the amount of data processed . Adopt a combination of ...

  10. JVM Memory structure 2 -- The Cenozoic and two of the Cenozoic Survivor District ( The next round of S0 And S1 Exchange roles , And so on and so on )、 Common tuning parameters

    One . Why is there a young generation Let's start with , Why do we need to divide the heap into generations ? Can't you do what he does without generations ? In fact, it can be done regardless of generations , The only reason for generations is to optimize GC performance . Think about it first , If there are no generations , So all of our objects are together ,GC When I was young, I ...

Random recommendation

  1. protobuf Solutions to compilation errors (iOS,OSX)

    protobuf Recently used protobuf, There's a problem changing the compiler tool . Now attach the solution The build process Full reference https://github.com/alexeyxo/protobuf-objc Compilation error ...

  2. python Get the usage of the current time

    1. Pilot storage :import datetime 2. Get the current date and time :now_time = datetime.datetime.now() 3. Format it to the date we want :strftime() such as :“2 ...

  3. Connect the output If it exists in php Many times echo Output js When

  4. win8.1 Let's work it out Visual C++ Incompatible methods

    1. download visual c++ Installation package Baidu cloud download address is :http://pan.baidu.com/s/1c0dRAYs 2. modify MSDEV.EXE file name After installation, find... In the installation directory MSDEV.EXE, ...

  5. DataTables Bind events to a table

    $(document).ready(function() { $('#example').dataTable(); $('#example tbody').on('click', 'tr', func ...

  6. ASP.NET Of Application brief introduction 1

    ASP.NET Medium Application 1. Application It's used to save information shared by all users . stay ASP Time , If the data to be saved does not or rarely change during the lifetime of the application , So use Application That's right ...

  7. Linux Study -- File system management

    1 Partitions and file systems Partition type Primary partition :<= 4 individual Extended partitions : There can only be one , It's also a kind of primary partition   Can't store data and format , Can only be used to contain logical partitions A logical partition : In the extended partition   IDE-- most 59 individual   ...

  8. Day05_JAVAEE series :XML

    XML summary 1) What is? xml? xml, eXtend Markup Language, Extensible markup language 2) html vs xml All by w3c Made by the organization . html Grammatical features : The grammar is loose      ...

  9. [Swift]LeetCode34. Find the first and last positions of the elements in the sort array | Find First and Last Position of Element in Sorted Array

    Given an array of integers nums sorted in ascending order, find the starting and ending position of ...

  10. msf land Windows 1

    Preface : Just finished this test , therefore , Write down your own testing process and the solutions to the problems encountered in the testing 1. Windows The version fits the type (Win 7 // XP...............) 2. With XP Target machine , With tools get ...