stay Hive Tuning inside , We often encounter a very small table and a large table Join, How to optimize .
It's time to MAPJOIN.
When a large table and one or more small tables do JOIN when , Best use MAPJOIN, The performance is better than ordinary JOIN Much faster . in addition ,MAPJOIN It can also solve the problem of data skew .MAPJOIN The basic principle of ： In a small amount of data ,SQL All the small tables specified by the user will be loaded into the execution JOIN In the memory of the operating program , So as to speed up JOIN Execution speed of .
select /*+MAPJOIN(b)*/ a.a1,a.a2,b.b2 from tablea a JOIN tableb b ON a.a1=b.b1
Cache multiple small tables ：
select /*+MAPJOIN(b,c)*/ a.a1,a.a2,b.b2 from tablea a JOIN tableb b ON a.a1=b.b1 JOIN tbalec c on a.a1=c.c1
mapjoin Of join It happened in map Stage ,join Of join It happened in reduce Stage ,mapjoin Can improve efficiency .
stay Hive0.11 front , You have to use MAPJOIN To start the optimization operation visually , Because it needs to load the small table into memory, so pay attention to the size of the small table .
SELECT /*+ MAPJOIN(smalltable)*/ .key,value FROM smalltable JOIN bigtable ON smalltable.key = bigtable.key
stay Hive0.11 after ,Hive The optimization is started by default , That is to say, it is not used when it needs to be displayed MAPJOIN Mark , It will trigger the optimization operation when necessary, which will be normal JOIN convert to MapJoin, You can set the trigger time of the optimization through the following two properties
The default value is true, Automatic account opening MAPJOIN Optimize .
The default value is 2500000(25M), Determine the size of the table using the optimization by configuring this property , If the size of the table is less than this value, it will be loaded into memory .
Be careful ： Use the default way to start the optimization. If the default name appears BUG( such as MAPJOIN It doesn't work ), Set the following two properties to fase Manual use MAPJOIN Tag to start the optimization .
hive.auto.convert.join=false( Turn off auto MAPJOIN Conversion operation ) hive.ignore.mapjoin.hint=false( Don't ignore MAPJOIN Mark )
Method 2 is not supported for the following queries (MAPJOIN Mark ) To start the optimized
select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM ( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM bigTable JOIN smallTableOne on (bigTable.idOne = smallTableOne.idOne) ) firstjoin JOIN smallTableTwo ON (firstjoin.idTwo = smallTableTwo.idTwo)
however , If you use method one, you don't have MAPJOIN Tag, the above query statement will be treated as two MJ perform , further , If you know the size of the table in advance, it can be loaded into memory , You can use the following attributes to separate the two MJ Merge into one MJ.
hive.auto.convert.join.noconditionaltask：Hive Based on the size of the input file, normal JOIN convert to MapJoin, And whether or not multiple MJ Merge into one hive.auto.convert.join.noconditionaltask.size： Multiple MJ Merge into one MJ when , The total size of the table must be less than this value , meanwhile hive.auto.convert.join.noconditionaltask It has to be for true
When a large table and one or more small tables do JOIN when , Best use MAPJOIN, The performance is better than ordinary JOIN Much faster . in addition ,MAPJOIN It can also solve the problem of data skew .MAPJOIN The basic principle of ： In a small amount of data ,SQL All the small tables specified by the user will be loaded into the execution JOIN In the memory of the operating program , So as to speed up JOIN Execution speed of . Use MAPJOIN when , We need to pay attention to ：
* LEFT OUTER JOIN The left table must be large ; * RIGHT OUTER JOIN The right table must be a large table ; * INNER JOIN Both left and right tables can be used as large tables ; * FULL OUTER JOIN Out of commission MAPJOIN; * MAPJOIN Support small table as subquery ; * Use MAPJOIN When you need to refer to a small table or subquery , Need to refer to alias ; * stay MAPJOIN in , You can use unequal connection or use OR Join multiple conditions ; * at present ODPS stay MAPJOIN At most... Is supported in 6 Zhang xiaobiao , Otherwise, report grammatical errors ; * If you use MAPJOIN, Then the total memory occupied by all small tables must not exceed 512M（ The amount of decompressed logical data ）.
MAPJOIN Decision logic
At the same time meet the following 2 Conditions ： 1) Join Stage max(join instance The elapsed time ) > 10 minute && max( join instance The elapsed time ) > 2 * avg( join instance The elapsed time ) 2) Participate in join The minimum table data size of is less than 100M （ The amount of logical data before decompression ）
MAPJOIN Memory custom settings
set odps.sql.mapjoin.memory.max=512 Set up mapjoin Maximum memory of time table , Default 512, Company M,[128,2048] Adjust between
This example is more comprehensive , It involves data skewing , It's also about when “ Watch ” Not very young (>512M) How to use it mapjoin.
select * from log a left outer join users b on a.user_id = b.user_id;
Log table (log) Generally speaking, there are many records , But user tables （users） It's not small ,600W+ The record of , hold users Distribute to all map It's also a big expense , and map join I don't support such a small watch . If you use ordinary join, We will encounter the problem of data skew .
select /*+mapjoin(b)*/ * from log a left outer join ( select /*+mapjoin(c)*/ d.* from ( select distinct user_id from log ) c join users d on c.user_id = d.user_id ) b on a.user_id = b.user_id;
The premise scenario for this solution is ： Daily members uv Not too much , namely log In the table count(distinct user_id) Not too big .
This article is from WeChat official account. - Big data is fun （havefun_bigdata）
The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the email@example.com Delete .
Original publication time ： 2021-01-17
Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .