MySQL's join function is weak?

Programmer Li Xiaobing 2020-11-11 22:42:38
mysql join function weak

Hello everyone , I'm Li Xiaobing , Today we will learn and make complaints about it MySQL Of Join function .

About MySQL Of join, You must have known a lot about it “ Anecdotes and anecdotes ”, Like two watches join We need small tables to drive big ones , Alibaba developer specifications prohibit more than three tables join operation ,MySQL Of join It's weak, it's exploding, and so on . These norms or statements are true or false , The time is right and the time is wrong , It's up to you to do it yourself join You can understand clearly only when you have a deep understanding .

below , Let's have a comprehensive understanding of MySQL Of join operation .


In daily database queries , We often need to join multiple tables to obtain the merged data of multiple tables at one time , This is about to use the database join grammar .join It is very common in the data field to merge two data sets , If you know more , Will find MySQL,Oracle,PostgreSQL and Spark All support this operation . The protagonist of this article is MySQL, If not specified below , That is to say MySQL Of join As a subject . and Oracle ,PostgreSQL and Spark It can be regarded as the big hanging boss, the join The optimization and implementation of the algorithm are better than MySQL.

MySQL Of join There are many rules , Maybe a little careless , Maybe a bad one join Statement will not only result in a full table query for a table , also May affect the cache of the database , As a result, most of the hot data has been replaced , Drag down the entire database performance .

therefore , The industry is targeting MySQL Of join Summed up a lot of norms or principles , For example, small tables drive large tables and prohibit more than three tables join operation . Next we will introduce MySQL join The algorithm of , and Oracle and Spark Of join To achieve contrast , And interspersed with the answer why the above-mentioned norms or principles are formed .

about join Implementation of operations , There are about Nested Loop Join ( Loop Nested connections ),Hash Join( Hash connection ) and Sort Merge Join( Sort merge join ) Three common algorithms , They have their own advantages and disadvantages and applicable conditions , Next, we will introduce .

MySQL Medium Nested Loop Join Realization

Nested Loop Join It's a scan driver table , Every time you read a record , According to join The index on the associated field of is driven to query the corresponding data in the driven table . It is suitable for scenarios with smaller subsets of connected data , So is it MySQL join The only algorithm to implement , We'll talk about it in detail next .

MySQL There are two Nested Loop Join Variations of the algorithm , Namely Index Nested-Loop Join and Block Nested-Loop Join.

Index Nested-Loop Join Algorithm

below , Let's initialize the relevant table structure and data

`id` int(11) NOT NULL,
`a` int(11) DEFAULT NULL,
`b` int(11) DEFAULT NULL,
KEY `a` (`a`)
delimiter ;;
# Define stored procedures to initialize t1
create procedure init_data()
declare i int;
set i=1;
insert into t1 values(i, i, i);
set i=i+1;
end while;
delimiter ;
# Call the store to initialize t1
call init_data();
# Create and initialize t2
create table t2 like t1;
insert into t2 (select * from t1 where id<=500)

According to the above order , Both tables have a primary key index id And an index a, Field b No index on . stored procedure init_data To the table t1 There's something in it 10000 Row data , In the table t2 It's inserted in 500 Row data .

for fear of MySQL The optimizer chooses the table as the driver table itself , The impact analysis SQL Statement execution , We use it directly straight_join To make the MySQL Use a fixed join table order to query , In the following sentence ,t1 It's the drive meter ,t2 It's driven watch .

select * from t2 straight_join t1 on (t2.a=t1.a);

Before using us article To introduce the explain Command to see the execution plan of the statement .

You can see from the above picture that ,t1 On the table a Fields are indexed by ,join The index is used in the procedure , So it's time to SQL The execution flow of the statement is as follows :

  • from t2 Read a row of data from the table L1;
  • Use L1 Of a Field , Go to t1 Table as a condition to query ;
  • Take out t1 The line that satisfies the condition in , Follow L1 Form the corresponding lines , Become part of the result set ;
  • repeat , Until the scan is done t2 surface .

We call this process Index Nested-Loop Join, abbreviation NLJ, Its corresponding flow chart is shown below .

It should be noted that , In the second step , according to a Field to table t1 When querying in , Index is used , So each scan only scans one line ( from explain It turns out that , It varies according to different case scenarios ).

Suppose the number of rows driving the table is N, The number of rows in the driven table is M. Because here join Statement execution , The driving table is to scan the whole table , The driven table uses the index , And each row in the table is driven by the query , So the whole thing join The approximate complexity of the process is N2log2M. obviously ,N It has a greater impact on the number of scan lines , Therefore, in this case, the small table should be used as the driving table .

Of course , The premise of all this is join The associated field of is a, also t1 Tabular a There is an index on the field .

If there is no index , When using the execution process shown in the figure above , Every time I come to t1 To match , It's going to be a full table scan . This also leads to the time complexity of the whole process programming N * M, This is unacceptable . therefore , When there is no index ,MySQL Use Block Nested-Loop Join Algorithm .

Block Nested-Loop Join

Block Nested-Loop Join The algorithm of , abbreviation BNL, It is MySQL Used when no index is available on the driven table join Algorithm , The specific process is as follows :

  • Keep watch t2 Of the current thread join_buffer in , Examples in this article SQL Not in the t2 Do any conditional filtering on , So that means t2 The whole table Put it in memory ;
  • Scan table t1, Each row of data is taken out , Just follow join_buffer Compare the data in , Satisfy join Conditions of the , Put it in the result set .

For example, the following one SQL

select * from t2 straight_join t1 on (t2.b=t1.b);

In this sentence explain The results are shown below . It can be seen that

It can be seen that , This time, join Process right t1 and t2 We did a full scan , And will table t2 Medium 500 All the data are put into memory join_buffer in , And for tables t1 Every row of data in , All going join_buffer Go through it again , Do it all 500 Second comparison , So it's a total of 500 * 10000 Secondary memory comparison operation , The specific process is shown in the figure below .

The main thing is , In the first step , It's not going to watch t2 Put all the data in join_buffer, It's based on specific SQL sentence , And put in different rows of data and different fields . For example, the following one join Statement will only change the table t2 In line with b >= 100 Data. b Fields are stored in join_buffer.

select t2.b,t1.b from t2 straight_join t1 on (t2.b=t1.b) where t2.b >= 100;

join_buffer It's not infinite , from join_buffer_size control , The default value is 256K. When the data to be stored is too large , It's just segmented storage , The whole execution process becomes :

  • Scan table t2, Save the eligible data lines into join_buffer, Because it's limited in size , Deposit in 100 The line is full of , Then go to step two ;
  • Scan table t1, Each row of data is taken out , Just follow join_buffer Compare the data in , Satisfy join Conditions of the , Put it in the result set ;
  • Empty join_buffer;
  • Take the first step again , Until all the data is scanned , because t2 Table has 500 Row data , So it's repeated 5 Time

This process reflects the name of the algorithm Block The origin of , Do it in chunks join operation . Because of the watch t2 The data is divided into 5 Deposit in join_buffer, Cause table t1 To be scanned by the full table 5 Time .

All deposited in branch 5 Deposit in
Memory operations 10000 * 500 10000 * (100 + 100 + 100 + 100 + 100)
Number of scanning lines 10000 + 500 10000 * 5 + 500

As shown above , And table data can be stored in join_buffer comparison , The number of memory judgments does not change , It's the product of the number of rows in two tables , That is to say 10000 * 500, But the driven table is scanned many times , Every time you deposit , The driven table needs to be scanned once , It affects the final efficiency of execution .

Based on the above two algorithms , We can draw the following conclusion , It's also most online about MySQL join The specification of the statement .

  • There is an index on the driven table , That is to say, you can use Index Nested-Loop Join When the algorithm , have access to join operation .

  • Whether it's Index Nested-Loop Join Algorithm or Block Nested-Loop Join We should use the small table as the driving table .

Because of the above two join The time complexity of the algorithm At least It also has a first-order relationship with the number of rows involved in the table , And it costs a lot of memory space , Therefore, the Alibaba developer specification says that it is strictly forbidden to use more than three tables join The operation is understandable .

But these two algorithms are just join One of the algorithms of , also More efficient join Algorithm , such as Hash Join and Sorted Merged join. Unfortunately, these two algorithms MySQL Currently, none of the mainstream versions of , and Oracle ,PostgreSQL and Spark They all support , Make complaints about online Tucao MySQL The reason for the weak explosion (MySQL 8.0 Version supports Hash join, however 8.0 It's not the mainstream version yet ).

In fact, Alibaba developer specifications are also from Oracle Migrate to MySQL when , because MySQL Of join Operation performance is too poor and decided to prohibit more than three tables join Operating regulations .

Hash Join Algorithm

Hash Join It's a scan driver table , utilize join Creates a hash table in memory , Then scan the driven table , Each read line of data , And find the corresponding data from the hash table . It is a common way to connect large data sets , Small amount of data is suitable for driving table , Scenes that can be put into memory , For it Large tables without indexes And parallel query scenarios can provide the best performance . Unfortunately, it is only applicable to the scenario of equivalent connection , such as on = where b.a_id.

Again, the above two tables join The sentence of , The implementation process is as follows

  • Will drive the table t2 Get the data that meets the conditions in , For each line join The field value is carried out hash operation , It is then stored in a hash table in memory ;
  • Traversing the driven table t1, Each row of eligible data is taken out , It's also about it join The field value is carried out hash operation , Take the results to a hash table in memory to find a match , If you find , The result set becomes part of .

It can be seen that , The algorithm and Block Nested-Loop Join There are similarities , It's just going to be disorderly Join Buffer It's a hash table hash table, So that data matching doesn't need to change join buffer Go through all the data in , But directly through hash, To approach O(1) The time complexity of getting the matching line , These two tables are greatly improved join Speed .

But due to the hash Characteristics of , The algorithm can only be applied to the scene of equivalent connection , Other scenarios cannot be connected with this algorithm .

Sorted Merge Join Algorithm

Sort Merge Join It is based on join The associated fields of the two tables sort the two tables ( If it's already sorted , For example, if there is an index on a field, you don't need to sort again ), Then merge the two tables once . If the two tables have been sorted , There is no need to sort when performing sort merge connections , At this time Merge Join The performance will be better than Hash Join.Merge Join It can be applied to non equivalence Join(>,<,>=,<=, But it doesn't include !=, That is to say <>).

It should be noted that , If the linked field already has an index , That is to say, if it has been ordered , You can merge directly , But if the linked field has no index , The execution process is shown in the figure below .

  • Traversal table t2, Read the data that meets the conditions , According to the connection field a Sort by ;
  • Traversal table t1, Read the data that meets the conditions , Also follow the join field a Sort by ;
  • Merge two sorted data , Get the result set .

Sorted Merge Join The main time consumption of the algorithm lies in sorting the two tables , So if the two tables have already been sorted by join field , The algorithm is even better than Hash Join The algorithm is faster . In one case , The algorithm is better than Nested Loop Join The algorithm should be fast .

below , Let's summarize the differences, advantages and disadvantages of the above three algorithms .

Nested Loop Join Hash Join Sorted Merge Join
Connection condition Apply to any condition Only for equivalent connections (=) Equivalent or non equivalent connection (>,<,=,>=,<=),‘<>’ With the exception of
The main consumption of resources CPU、 disk I/O Memory 、 Temporary space Memory 、 Temporary space
characteristic When there is a highly selective index or restricted search, it is more efficient , Can quickly return to the first search results When there is no index or index condition is fuzzy ,Hash Join Than Nested Loop It works . Often than Merge Join fast . In the data warehouse environment , If there are many records in the table , Efficient When there is no index or index condition is fuzzy ,Sort Merge Join Than Nested Loop It works . When fields are indexed ahead of time , Than hash join fast , And support more connection conditions
shortcoming No index or table records are inefficient Building a hash table requires a lot of memory , The first time the results return slowly All tables need to be sorted . It's designed for optimal throughput , And do not return data until all results are found
Need index yes ( It's inefficient without indexing ) no no

about Join Understanding of operation

Finished. Join Related algorithms , Let's also talk about join Operational business understanding .

When the business is not complex , majority join It's not irreplaceable . For example, in order records, only the order user's user_id, You need to get the user's name when returning information , There are several possible implementations :

  1. A database operation , Use join operation , Order table and user table join, Return with user name ;
  2. Two database operations , Two queries , Get order information for the first time and user_id, The second is based on user_id Take your name , Using code programs to merge information ;
  3. Use redundant user names or from ES Reading relational databases in China and Africa .

The above solutions can solve the problem of data aggregation , And based on the program code to deal with , Than the database join Easier to debug and optimize , For example, the user name is not taken from the database , Instead, look for it in the cache first .

Of course , join The operation is not worthless , So every technology has its own usage scenarios , The above solutions or rules are summed up by the Internet development team , For high concurrency 、 Write lightly and read more 、 Distributed 、 Simple business logic , These scenarios generally do not require high data consistency , Even dirty reading is allowed .

however , In financial bank or financial and other enterprise application scenarios ,join Operation is indispensable , These applications are generally low concurrency 、 Frequent complex data writes 、CPU Dense, not IO Concentrated , The main business logic is processed through the database and even contains a large number of stored procedures 、 Systems with high requirements for consistency and integrity .

Personal blog , Welcome to play

本文为[Programmer Li Xiaobing]所创,转载请带上原文链接,感谢

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云