Java Concurrent pseudo sharing

Five words floating in the sky 2020-11-10 15:18:44
java concurrent pseudo sharing


False sharing (false sharing), The silent performance killer of concurrent programming

 

In concurrent programming , Most of our focus is on how to control access control for shared variables ( The code level ), But few people pay attention to system hardware and JVM Underlying related factors . I learned a cow some time ago X A high-performance asynchronous processing framework for Disruptor, It is known as “ The fastest messaging framework ”, Its LMAX Architecture can process... Per second in a thread 6 One million Order ! I'm talking about Disruptor Why so fast , Came into contact with a concept —— False sharing ( false sharing ), Mentioned : Write contention on cache lines is running on SMP The most important limiting factor of the scalability of parallel threads in the system . Because it's hard to see from the code whether there will be pseudo sharing , Some people describe it as a silent performance killer .

This paper is only for the current study of consolidation and collation , At present, there is no in-depth research and practice , Hope to provide some help to understand pseudo sharing from scratch .

The nonstandard definition of pseudo sharing is : In the cache system, rows are cached (cache line) Stored for units , When multiple threads modify independent variables , If these variables share the same cache row , Will inadvertently affect each other's performance , This is pseudo sharing .

Next, we will analyze the causes and consequences of pseudo sharing in detail . First , We need to understand what a caching system is .

 

One 、CPU cache

CPU Baidu Encyclopedia of cache is defined as :

CPU cache (Cache Memory) It's located in CPU Temporary storage between and memory , Its capacity is much smaller than memory, but its switching speed is much faster than memory .】
 The emergence of cache is mainly to solve CPU The contradiction between the speed of operation and the speed of memory reading and writing is not matched , because CPU The operation speed is much faster than the memory reading and writing speed , This will make CPU It takes a long time to wait for data to arrive or write data to memory .
 The data in the cache is a small part of memory , But this small part is in a short time CPU About to visit , When CPU When calling large amounts of data , You can avoid calling memory directly from the cache , So as to speed up the reading speed .

CPU There are several layers of caching between and main memory , Because even direct access to main memory is very slow . If you're doing the same operation on a piece of data multiple times , Then load it to the distance when performing the operation CPU It makes sense to be close .

According to the data reading order and CPU Degree of tightness of combination ,CPU Cache can be divided into one level cache , Second level cache , Some high end CPU There are also three levels of cache . All data stored in each level of cache is part of the next level of cache , The closer we get CPU The faster and smaller the cache . therefore L1 Cache is small but fast ( Translation notes :L1 Represents the first level cache ), And it's next to the CPU kernel .L2 Bigger , It's also slower , And still can only be a single CPU Nuclear use .L3 More common in modern multicore machines , Still bigger , More slowly , And by all the CPU Nuclear sharing . Last , You have a piece of main memory , By all on all slots CPU Nuclear sharing . Having three levels of cache CPU, It can reach the level 3 cache 95% shooting , Only less than 5% You need to query from memory .

The storage structure of multi-core machine is shown in the figure below :

When CPU When performing an operation , It goes first L1 Find the data you need , Go again L2, And then there was L3, Finally, if none of these caches , The required data should be taken to the main memory . The farther you go , The longer it takes . So if you're doing something very often , You have to make sure the data is in L1 In cache .

Martin Thompson Some consumption data of cache miss are given , As shown below :

Two 、MESI Agreement and RFO request

We know from the last section that , Each core has its own private L1,、L2 cache . So multithreaded programming , The thread of another core wants to access the current core L1、L2 Cache row data , What to do ?

Some people say that it can be done by 2 The core goes directly to 1 Cache lines for cores , This is, of course, feasible , But it's not fast enough . Cross core access needs to be done through Memory Controller( Memory controller , Is the computer system internal control memory and through the memory controller to make memory and CPU An important part of exchanging data between ), The typical situation is No 2 The core often visits 1 This data of the core , So every time there's cross core consumption .. What's worse is , It's possible that 2 The core and the first 1 Cores are not in one slot , Besides, Memory Controller The bus bandwidth is limited , Can't handle so much data transmission . therefore ,CPU Designers prefer a different approach : If the first 2 A core needs this data , From the first 1 The core sends the data directly to , Data only needs to be passed once .

So when will the cache line transfer occur ? The answer is simple : Occurs when one core needs to read another core's dirty cache lines . But how does the former judge that the latter's cache line is dirty ( Write ) What about it ?

The above questions will be answered in detail below . First of all, we need to talk about an agreement —— MESI agreement . Now the mainstream processors use it to ensure the coherence of cache and memory .M、E、S and I On behalf of the use of MESI The four states in which the cache line is in the protocol :

M( modify ,Modified): The local processor has modified the cache line , It's dirty , Its contents are different from those in memory , And this cache Only a local copy ( proper );
E( proper ,Exclusive): The contents of the cache row are the same as in memory , And no other processor has this line of data ;
S( share ,Shared): The contents of the cache row are the same as in memory , It is possible that other processors also have copies of this cache line ;
I( Invalid ,Invalid): Cache line failure , Out of commission .

The following four states illustrate how :

 initial : At the beginning , The cache line does not load any data , So it's in I state .
Write locally (Local Write): If the local processor writes data to I State cache line , Then the state of the cache line becomes M.
Read locally (Local Read): If the local processor reads in I State cache line , Obviously, this cache has no data for it . There are two situations at this time :(1) There is no such data in the cache of other processors , After loading data from memory to this cache line , And set it to E state , It means that only my family has this data , None of the other processors ;
                                                                                                     (2) The cache of other processors has this row of data , Set the state of the cache row to S state .( remarks : If in the M State cache line , Then the local processor writes / read out , The state doesn't change )
Remote reading (Remote Read): Suppose we have two processors c1 and c2, If c2 Need to read another processor c1 Cache line content of ,c1 It needs to cache the contents of the line through the memory controller (Memory Controller) Send to c2,c2 After receiving, set the corresponding cache line status to S. Before setting , Memory has to get this data from the bus and save it .
Remote write (Remote Write): Actually, it's not remote writing , It is c2 obtain c1 After the data of , Not for reading , It's about writing . It's local writing , It's just c1 Also have copies of this data , What should I do ?c2 Will send out a RFO (Request For Owner) request , It needs to have access to this line of data ,
                        The corresponding cache lines of other processors are set to I, Except for itself , Who can't move this line of data . This ensures data security , At the same time processing RFO Request and setup I The process of write operation will bring great performance consumption .

The state transition is supplemented by the following figure :

We know from the last section that , The cost of writing is high , Especially when it needs to be sent RFO When the news . When we write the program , When will it happen RFO What about the request ? There are two :

1.  Thread work moves from one processor to another ,  All cache lines it operates on need to be moved to the new processor . After that, if you write the cache line again , There are multiple copies of the cache on this row , Need to send  RFO  Request the .
2.  Two different processors do need to operate on the same cache line 

Next , We need to understand what cache lines are .

 

3、 ... and 、 Cache line

It was mentioned at the beginning of the article that , In the cache system, rows are cached (cache line) Stored for units . Cache lines are usually 64 byte ( Translation notes : This article is based on 64 byte , Other lengths such as 32 The main point of this paper is that ), And it effectively references a block of address in main memory . One Java Of long The type is 8 byte , So it can be saved in a cache row 8 individual long Variable of type . therefore , If you visit a long Array , When a value in the array is loaded into the cache , It will load additional 7 individual , So you can traverse the array very quickly . in fact , You can traverse any data structure allocated in contiguous blocks of memory very quickly . And if the items in your data structure are not adjacent to each other in memory ( Like the list ), You won't get the benefits of free cache loading , And each item in these data structures may have a cache miss .

If there is such a scenario , There are multiple threads that operate on different member variables , But the same cache line , What will happen at this time ?. you 're right , False sharing (False Sharing) The problem is ! There is one Disruptor An example of a classic project , as follows :

Above picture , One runs on the processor core1 The thread on wants to update the variable X Value , At the same time, another one runs on the processor core2 The thread on wants to update the variable Y Value . however , These two frequently changed variables are on the same cache line . Two threads will take turns sending RFO news , Take ownership of this cache line . When core1 Take ownership and start updating X, be core2 The corresponding cache line needs to be set to I state . When core2 Take ownership and start updating Y, be core1 The corresponding cache line needs to be set to I state ( Failure state ). Taking ownership in turn not only brings a lot of RFO news , And if a thread needs to read this row of data ,L1 and L2 The cache is full of invalid data , Only L3 The cache is synchronized data . Once upon a time, we knew , read L3 The data of the data has a great impact on performance . Worse, cross slot reads ,L3 Both miss, It can only be loaded from memory .

On the surface, X and Y They are all operated by independent threads , And there's no relationship between the two operations . It's just that they share a cache line , But all the competitive conflicts come from sharing .

 

Four 、 Encounter pseudo sharing

well , So we're going to use code To experiment and corroborate .

public class FalseShareTest implements Runnable {
public static int NUM_THREADS = 4;
public final static long ITERATIONS = 500L * 1000L * 1000L;
private final int arrayIndex;
private static VolatileLong[] longs;
public static long SUM_TIME = 0l;
public FalseShareTest(final int arrayIndex) {
this.arrayIndex = arrayIndex;
}
public static void main(final String[] args) throws Exception {
Thread.sleep(10000);
for(int j=0; j<10; j++){
System.out.println(j);
if (args.length == 1) {
NUM_THREADS = Integer.parseInt(args[0]);
}
longs = new VolatileLong[NUM_THREADS];
for (int i = 0; i < longs.length; i++) {
longs[i] = new VolatileLong();
}
final long start = System.nanoTime();
runTest();
final long end = System.nanoTime();
SUM_TIME += end - start;
}
System.out.println(" The average time taken :"+SUM_TIME/10);
}
private static void runTest() throws InterruptedException {
Thread[] threads = new Thread[NUM_THREADS];
for (int i = 0; i < threads.length; i++) {
threads[i] = new Thread(new FalseShareTest(i));
}
for (Thread t : threads) {
t.start();
}
for (Thread t : threads) {
t.join();
}
}
public void run() {
long i = ITERATIONS + 1;
while (0 != --i) {
longs[arrayIndex].value = i;
}
}
public final static class VolatileLong {
public volatile long value = 0L;
public long p1, p2, p3, p4, p5, p6; // Block this trip 
 }
}
 

The logic of the above code is simple , Four threads modify the contents of an array of different elements . The type of element is VolatileLong, There is only one long integer member value and 6 Unused long members .value Set to volatile It's to make value Changes to are visible to all threads . The program is executed in two cases , In the first case, the last third line is not blocked ( see " Block this trip " word ), The second line is the last one . in order to " Guarantee " Relative reliability of data , The program takes 10 The average execution time is . The execution is as follows ( execution environment :32 position windows, Tetranuclear ,8GB Memory ):

         

            ( No shielding )                                                          ( shielding )

Two programs that are as like as two peas in logic. , The former takes about the same time as the latter 2.5 times , I can't believe ! So at this point , We use pseudo sharing (False Sharing) To analyze . The former longs Array of 4 Elements , because VolatileLong Only 1 Long integers , So the entire array will be loaded into the same cache line , But there are 4 Threads operate on this cache line at the same time , So the pseudo sharing happened quietly .

Based on this , We have reason to believe that , Within a certain number of threads ( Pay attention to thinking : Why the emphasis is within a certain number of threads ), As the number of threads increases , The more frequently pseudo sharing occurs , Intuitively, the longer the execution time . To prove this point , I use single thread on the same machine 、2、4、8 Threads , The test was carried out in two cases with and without filling . The execution scenario is to take 10 The average execution time is , The results are shown below :

 

5、 ... and 、 How to avoid pseudo sharing ?

One of the solutions , That is to make the objects operated by different threads in different cache lines .

So how to do it ? In fact, there is an answer in the line of code we annotated , That's cache row padding (Padding) . Now analyze the above example , We know that a cache line has 64 byte , and Java The object head of the program is fixed 8 byte (32 Bit system ) or 12 byte ( 64 Bit system turns on compression by default , Do not open compression for 16 byte ), So we just need to fill in 6 Make up a useless long integer 6*8=48 byte , Make a difference VolatileLong Object is in a different cache line , It avoids pseudo sharing ( 64 Bit system over cached lines 64 Bytes don't matter , Just make sure that different threads do not operate on the same cache line ).

Pseudo sharing is easy to happen in multicore programming , And it's very hidden . for example , stay JDK Of LinkedBlockingQueue in , There is a reference to the queue header head And the reference to the end of the queue tail . This queue is often used in asynchronous programming , The values of these two references are often modified by different threads , But they are likely to be in the same cache line , So there's pseudo sharing . More threads , More nuclei , The more negative the performance is .

Because of some Java Compiler optimization strategy , Those unused data may be optimized during compilation , We can add some code to the program to prevent it from being compiled and optimized . as follows :

public static long preventFromOptimization(VolatileLong v) {
return v.p1 + v.p2 + v.p3 + v.p4 + v.p5 + v.p6;
}

Another technique is to use compilation instructions , To force each variable to align .

The following code explicitly shows that the compiler uses __declspec( align(n) ) here n=64, according to cache line Alignment of borders .

__declspec (align(64)) int thread1_global_variable;
__declspec (align(64)) int thread2_global_variable;

When using arrays , stay cache line Tail filling padding To ensure that the data elements are in cache line The border begins . If you can't guarantee that the array follows cache line Alignment of borders , Fill in the data structure 【 Array elements 】 Make it cache line Twice the size . The following code explicitly populates the data structure to conform to cache line alignment . And through __declspec( align(n) ) Statement to ensure that the array is also aligned . If the array is dynamically allocated , You can increase the size of the allocation , And adjust the pointer to cache line The border .

 
struct ThreadParams
{
// For the following 4 variables: 4*4 = 16 bytes
unsigned long thread_id;
unsigned long v; // Frequent read/write access variable
unsigned long start;
unsigned long end;
// expand to 64 bytes to avoid false-sharing 
// (4 unsigned long variables + 12 padding)*4 = 64
int padding[12];
};
 

besides , There is also a lot of research on pseudo sharing on the Internet , Some schemes based on data fusion are proposed , Interested students can learn about .

6、 ... and 、 For pseudo sharing , What should we do in actual development ?

Through the extensive introduction above , We already know the impact of pseudo sharing on programs . that , In the actual production and development process , Do we have to solve the potential pseudo sharing problem by filling the cache row ?

It doesn't have to be .

First of all, it has been emphasized many times , Pseudo sharing is very hidden , We can't detect pseudo sharing events from the system level through tools . secondly , Different types of computers have different microarchitectures ( Such as 32 Bit system and 64 Bit system java The number of objects is different ), If design to cross platform design , That's even harder to grasp , An exact population scheme is only applicable to a specific operating system . also , Cache resources are limited , If it's filled, it's a waste of precious cache resources , Not suitable for a wide range of applications . Last , At present, the mainstream Intel Microarchitecture CPU Of L1 cache , Have been able to reach 80% The above hit rate .

in summary , Not every system is suitable to spend a lot of energy to solve the potential pseudo sharing problem .

版权声明
本文为[Five words floating in the sky]所创,转载请带上原文链接,感谢

  1. 【计算机网络 12(1),尚学堂马士兵Java视频教程
  2. 【程序猿历程,史上最全的Java面试题集锦在这里
  3. 【程序猿历程(1),Javaweb视频教程百度云
  4. Notes on MySQL 45 lectures (1-7)
  5. [computer network 12 (1), Shang Xuetang Ma soldier java video tutorial
  6. The most complete collection of Java interview questions in history is here
  7. [process of program ape (1), JavaWeb video tutorial, baidu cloud
  8. Notes on MySQL 45 lectures (1-7)
  9. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  10. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  11. 精进 Spring Boot 03:Spring Boot 的配置文件和配置管理,以及用三种方式读取配置文件
  12. Refined spring boot 03: spring boot configuration files and configuration management, and reading configuration files in three ways
  13. 【递归,Java传智播客笔记
  14. [recursion, Java intelligence podcast notes
  15. [adhere to painting for 386 days] the beginning of spring of 24 solar terms
  16. K8S系列第八篇(Service、EndPoints以及高可用kubeadm部署)
  17. K8s Series Part 8 (service, endpoints and high availability kubeadm deployment)
  18. 【重识 HTML (3),350道Java面试真题分享
  19. 【重识 HTML (2),Java并发编程必会的多线程你竟然还不会
  20. 【重识 HTML (1),二本Java小菜鸟4面字节跳动被秒成渣渣
  21. [re recognize HTML (3) and share 350 real Java interview questions
  22. [re recognize HTML (2). Multithreading is a must for Java Concurrent Programming. How dare you not
  23. [re recognize HTML (1), two Java rookies' 4-sided bytes beat and become slag in seconds
  24. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  25. RPC 1: how to develop RPC framework from scratch
  26. 造轮子系列之RPC 1:如何从零开始开发RPC框架
  27. RPC 1: how to develop RPC framework from scratch
  28. 一次性捋清楚吧,对乱糟糟的,Spring事务扩展机制
  29. 一文彻底弄懂如何选择抽象类还是接口,连续四年百度Java岗必问面试题
  30. Redis常用命令
  31. 一双拖鞋引发的血案,狂神说Java系列笔记
  32. 一、mysql基础安装
  33. 一位程序员的独白:尽管我一生坎坷,Java框架面试基础
  34. Clear it all at once. For the messy, spring transaction extension mechanism
  35. A thorough understanding of how to choose abstract classes or interfaces, baidu Java post must ask interview questions for four consecutive years
  36. Redis common commands
  37. A pair of slippers triggered the murder, crazy God said java series notes
  38. 1、 MySQL basic installation
  39. Monologue of a programmer: despite my ups and downs in my life, Java framework is the foundation of interview
  40. 【大厂面试】三面三问Spring循环依赖,请一定要把这篇看完(建议收藏)
  41. 一线互联网企业中,springboot入门项目
  42. 一篇文带你入门SSM框架Spring开发,帮你快速拿Offer
  43. 【面试资料】Java全集、微服务、大数据、数据结构与算法、机器学习知识最全总结,283页pdf
  44. 【leetcode刷题】24.数组中重复的数字——Java版
  45. 【leetcode刷题】23.对称二叉树——Java版
  46. 【leetcode刷题】22.二叉树的中序遍历——Java版
  47. 【leetcode刷题】21.三数之和——Java版
  48. 【leetcode刷题】20.最长回文子串——Java版
  49. 【leetcode刷题】19.回文链表——Java版
  50. 【leetcode刷题】18.反转链表——Java版
  51. 【leetcode刷题】17.相交链表——Java&python版
  52. 【leetcode刷题】16.环形链表——Java版
  53. 【leetcode刷题】15.汉明距离——Java版
  54. 【leetcode刷题】14.找到所有数组中消失的数字——Java版
  55. 【leetcode刷题】13.比特位计数——Java版
  56. oracle控制用户权限命令
  57. 三年Java开发,继阿里,鲁班二期Java架构师
  58. Oracle必须要启动的服务
  59. 万字长文!深入剖析HashMap,Java基础笔试题大全带答案
  60. 一问Kafka就心慌?我却凭着这份,图灵学院vip课程百度云