Don't download JDK any more: elasticsearch's first big hole in the domestic arm environment

Tencent cloud middleware team 2021-03-30 07:36:21
don download jdk elasticsearch big

Introduction : The author of this paper aims to solve a problem of JDK Performance issues , Analyze... From the stack , To GC analysis , Until then Safepoint Cause analysis , Finally locate the root cause of the problem and the use of JDK The version is about . And collate the written form , With all the Java Related development students share this experience .


The problem is coming.

The author recently encountered such a problem in his work : A customer has launched a new Elasticsearch application , But after running for a period of time, it becomes very slow , Even the query timed out . After restart, the service will resume , But every 3~4 The problem reappears in an hour .

In response to this question , My colleagues around me also helped to do a simple analysis , There are a lot of Merge The thread of , What should I do ? Based on my previous experience in locating problems , Usually by Thread Dump Log analysis , You can find the right direction for the cause of the problem , Then analyze the reason why the problem is repeated . In this way , The analysis of the problem should not be complicated .But, Then there were twists and turns .


Confused stack logs

Because of network isolation , It can only be obtained by the cooperation of customers Thread Dump journal . And emphasized with the customer to obtain Thread Dump The technique of journaling , Each node gets it every few seconds , Output to a separate file . The cluster involves three nodes , Let's call these three nodes 39,158, 211. After the problem recurred , Got the first batch of Thread Dump file :

From the size of the file , It is easy to see that 39 Node probability is a problem node , its Thread Dump The log is obviously much larger : Query slows down or gets stuck , It usually appears as a large number of Worker Thread Be busy , in other words , The number of active threads has increased significantly . And in the ES(Elasticsearch, Hereinafter referred to as ES) Under the default query behavior of , As long as there is a problem with one node , It's going to involve the entire query .

So let's choose from three nodes 1 individual Thread Dump The overall situation of the file thread is compared :

The name of the node




Total threads




















According to the thread pool classification statistics :

The name of the node




Lucene Merge Thread
















































You can find :39 nodes Lucene Merge Thread Obviously more , And the other two nodes don't have any Merge The thread of .

Right again 39 Node Thread Dump File for in-depth analysis , The anomalies found are summarized as follows :

1. Lucene Merge Thread achieve 77 individual , The call stack for one of the threads is as follows :

2. Yes 8 Threads are competing for locking ExpiringCache:

3. Yes 8 All threads are doing HashMap#hash Calculation :

The phenomenon 1 It is mentioned in 77 At the same time Merge The thread of , But we can't be sure of these Merge Tasks are triggered at the same time , Or because the system is too slow to process and gradually accumulate into such a state . Anyway, it seems like an important clue . Considering that this is a new online application , Research on environmental information is as important as using posture :

  • The cluster of 3 Nodes , At present, there are 500+ individual Indices. Write the number of active partitions on each node 70 about .
  • Create by tenant Index, Every tenant creates 3 individual Indices. Early online , Write throughput is low . Every index in every minute Flush Yes Segment stay KB~ Count MB Between .

I began to doubt this particular use : There are multiple write active indexes in the cluster , But the number of writes per minute is small , stay KB To MB Level . It means ,Flush It can be triggered periodically , It's not triggered when a preset threshold is exceeded . This way of writing , It can lead to a large number of small files . A sample of several new indexes was observed Segment file , It's true that every time you generate a file, it's very small .

About the first 2 Dot phenomenon , I read it carefully Source code :

  • UnixFileSystem It is necessary to standardize the path of a new file according to the operating system standard .
  • The results of standardization are stored in ExpiringCache In the object .

Multiple threads are scrambling to call ExpiringCache#put operation , This profile reflects the high frequency changes in the file list , This shows that there is high frequency in the system Flush and Merge operation .

This exacerbated my doubts about the use of posture :" The drizzle is continuous " Writing of formula , Passive trigger Flush, If the period is the same , It means at the same time Flush, Multiple Shard meanwhile Merge The probability of getting bigger .

therefore , I started to simulate this usage in a test environment , Create a similar number of tiles , Control write frequency . Plan to have the test program run for at least one day , See if this problem can be reproduced . While the program is running , I continue to investigate Thread Dump journal .

The first 3 In the dot phenomenon , Just do it once hash Calculation , But it's very slow . If you combine these three phenomena , You can see that all the call points are doing CPU Calculation . Logically speaking ,CPU Should be particularly busy .

When the problem reappears on the scene , The customer helped to get CPU Usage and load information , Results show CPU Resources are very idle . before this , Colleagues have also investigated IO resources , It's also very free . This eliminates the impact of system resources . At this time, we also found that , It's a random recurrence of each node , It's not about the machine .

When the day goes by , In the local test environment , The problem didn't recur . Although the posture is not elegant , But it doesn't seem to be the crux of the problem .


weird STW interrupt

adopt jstack Command acquisition Thread Dump When the log , Need make JVM The process enters Safepoint, It means that the whole process is suspended first . Acquired Thread Dump journal , It is also the instantaneous state of each thread when the process is suspended .

All busy threads happen to be doing CPU Calculation , but CPU Not busy . This suggests that further investigation is needed GC journal .

Field applications are not turned on GC journal . Considering that the problem has not yet recurred , adopt jstat Tools to view GC Times and GC The significance of counting time is not great . Let the field personnel in jvm.options The following parameters are added manually to enable GC journal :


add to PrintGCApplicationStoppedTime It's to put every time JVM What happened in the process STW(Stop-The-World) Interrupts are recorded in GC In the log . Usually , Such kind STW It's all because of GC cause , It may also be related to biased locking .

Just rebooted , The field staff put GC journal tail The results were sent back , This is to confirm that the configuration has taken effect . It's weird , The printing process just restarted is not stopping STW journal :

About STW journal (”…Total time for which application threads were stopped…”), It is necessary to explain briefly :

JVM Sometimes you need to perform some global operations , Typical as GC、 Biased lock recycling , You need to pause all running Thread, It depends on JVM Of Safepoint Mechanism ,Safepoint It's like a red light on a big road .JVM Every time you enter STW(Stop-The-World) Stage , They print a log line like this :

2020-09-10T13:59:43.210+0800: 73032.559: Total time for which application threads were stopped: 0.0002853 seconds, Stopping threads took: 0.0000217 seconds

In this line of Journal , There's a hint STW The duration of the phase is 0.0002853 second , And stop all threads (Stopping threads) It cost 0.0000217 second , The former includes the latter . Usually ,Stopping threads It's a tiny fraction of time , If it is too long, it may be related to the implementation details of the code , It's just a lot of expansion here .

Back to the question , Print a lot of STW journal , It's easy to think about biased lock recycling . Until the problem reappears , Got it 3 Complete of nodes GC journal , Find out whether it's YGC still FGC, The frequency of triggering is very low , This excludes GC The impact of .

A lot of STW journal , Made me realize that this phenomenon is extremely unreasonable . Some students raised doubts , Every interruption time is very short ? Wrote a simple tool , For every minute of STW Interruption frequency 、 The total interruption time was counted :

The statistics show that :

  • Every minute is normal 5 About a second break .
  • stay 11:29~11:30 Between , The frequency of interruptions increases sharply , This is exactly when the problem phenomenon began to appear . The total interruption time per minute is even as high as 20~30 second .

This is like comparison. , a section 1 Kilometers of road , Normally, you don't see any traffic lights , Now there's a sudden increase of dozens of traffic lights , It's really breaking down . These interruptions are a good explanation for “ All the threads are doing CPU Calculation , However CPU The resources are very idle ” The phenomenon of .


About Safepoint The investigation of

Safepoint There are many types of , In order to confirm Safepoint Specific types of , Continue to ask the students on the spot to help , stay jvm.options Add the following parameters , open JVM journal :


Waiting for the problem to recur , I based on ES The running log counts the frequency of read and write requests per minute :

The number of read and write requests is relatively stable , This eliminates the reason for the sharp increase in user requests .

Get JVM When the log , See a lot of Safepoint The type is ForceSafepoint:

And biased lock recycling is related to Safepoint It should be like this :

It makes me wonder . Start searching the Internet ForceSafepoint The trigger for , result , nothing .

see hotspot Source code , Found at least 5 Kind of related scenes :

so ,ForceSafepoint It's a “ Chowder ”, It's like in the chart “Others” a . So let's start here , I will JDK Added to the “ Key suspects ” In the list .

Continue analysis JVM journal . In each one Safepoint In the log , At the same time, the total number of threads at that time was recorded (threads: total A column of ):

Take a look back. , When something goes wrong , There is a significant difference in the total number of threads among the three nodes , After the problem , The total number of threads is increasing , In especial Lucene Merge Threads.

“ Multiple Lucene Merge Mission ” And “ There's a sudden increase in ForceSafepoint/STW interrupt ”, Which is “ because ” Which is “ fruit ”?

basis JVM The log counts every minute of ForceSafepoint frequency , And changes in the total number of threads . Superimpose two curves together , Get the picture below :

From the picture , It seems to be ForceSafepoint The gradual increase comes first , The number of threads that follow gradually increases . in other words ,STW More interruptions come first , And then many Merge Task threads are just beginning to accumulate , like , Several traffic lights were suddenly added to a list , And then the catalog gets congested .

here , Start asking for advice Kona JDK Students from the team , I put GC Logs and thread dump The log was shared with him , And told him what I found so far , The biggest doubts are these abnormal ForceSafepoints, I need to understand ForceSafepoints Why .

After a while , He replied to me : This is arm Version of jdk. For want of arm Compiler emergency cabinet , I can't provide one for me at the moment debug Version of arm jdk.

Suddenly understood , I committed " first impressions are strongest " Error of , Even though we realized from the beginning that there was a need to investigate the environment .

No wonder it's local X86 It can't be reproduced in the environment . No wonder online search ForceSafepoint When you get nothing .

Next, when communicating with customers by telephone , Informed :

  1. Similar business , In another set X86 Environment , No such problems have been found .
  2. In this set arm Environment , There's another one Elasticsearch colony , Less requests , But no such problems were found .
  3. Used in the environment arm jdk It was downloaded from the Internet , The company behind it is unknown .

About the first 2 Another one of the environments mentioned in point Elasticsearch colony , I'm more concerned about its GC Is there a similar phenomenon in the log . As for the absence of such problems , Easy to understand , Because this kind of problem is often a systematic problem under the superposition of business load characteristics and environment multiple factors . The students on the scene cooperated to open this Elasticsearch Clustered GC And JVM journal , Phenomena coexist , It's just ForceSafepoint Is less frequent than the problem cluster .

thus , The cause of the problem clearly points to arm Environment and JDK edition .

later , Microservice platform TSF Zang Lin of the team intervened , He provided one that added debug The information of Kona JDK edition , Although this version is much slower than the normal version , After replacement , We find that the recurrence period of the problem is significantly longer .

Get the latest JVM After the log , Zang Lin analyzes these ForceSafepoint Both with Inline Cache Buffer of . Of course , This may be arm All in the environment JDK The common problems of the version , Or maybe it was just downloaded before JDK There is a problem with the version .

Next , We will be in the environment of JDK The version is replaced by the official Release Of Kona JDK. later , The problem never recurred . in other words , Replace with Kona JDK after , Problem solved .

I counted the use of KonaJ DK After STW Interruption times and interruption time , Found an order of magnitude improvement :

  • primary JDK edition : Every minute STW interrupt 5000 Time ~18000 Time , The total number of interrupts per minute may reach 10 second ~30 second .
  • Kona JDK: Every minute STW Interrupt in single digits , The total interruption time per minute is in 100~200ms Between .

so ,Kona JDK Than the original question JDK Version has an order of magnitude improvement in performance .


Summary and review

Let's sort out the analytical thinking of the whole problem :

1. Through stack analysis , Found a large number of threads are doing CPU Calculation , but CPU Resources are more free . About a lot of Merge Threads There is a certain degree of bewilderment in this phenomenon , But it's a problem “ fruit ” Instead of “ because ”.

2. adopt GC Log analysis , Find out GC Frequency and GC Time is low , but GC There are a lot of STW Related logs , Need to confirm the associated Safepoint type .

3. adopt VM/Safepoint Log analysis , Confirmed. Safepoint The type of ForceSafepoint. By comparing the phenomena of different environments and different clusters , Began to doubt JDK There is a problem with the version .

4. Replace Kona JDK after ,ForceSafepoints Frequency by Every minute 5000 Down to single digits many times , Problem solving .

Through the analysis of this problem , Learn a lesson : At the beginning of the problem analysis, we should carefully investigate the cluster environment . Of course , The biggest lesson is : Don't download JDK ah

This article is from WeChat official account. - Tencent cloud intermediate (gh_6ea1bc2dd5fd)

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the Delete .

Original publication time : 2020-10-21

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

本文为[Tencent cloud middleware team]所创,转载请带上原文链接,感谢