Case sharing: Online failure caused by Dubbo 2.7.12 bug

Insect catching master 2021-10-14 06:58:27
case sharing online failure caused


Late one night recently , Just after taking a bath, I received a call from the business side , Talk about them dubbo The service is out of order , You want me to help you check .

The phone , Asked them what time

  • Is it a damaged fault on the line ?—— yes
  • Did you stop the loss ?—— Stop loss
  • Is there a reserved scene ?—— No,

So I turned on the computer , Even on VPN Look at the problem . For the sake of understanding , The architecture is simplified as follows


Just pay attention A、B、C Three services , The calls between them are dubbo call .

In case of failure B The service has several machines completely rammed to death , Can't handle the request , The remaining normal machine requests surge , Time consuming increase , Here's the picture ( Figure 1 request volume 、 Figure 2 )


Troubleshoot problems

As the site has been damaged , You can only look at the monitoring and log first

  • monitor

In addition to the above monitoring , Looking at the B service CPU And memory , The memory of several machines that found faults increased more , All have reached 80% Horizontal line of , And CPU Consumption also becomes more


At this time, I doubt the memory problem , So I took a look JVM Of fullGC monitor


Sure enough fullGC Time goes up a lot , Basically, it can be concluded that the service is unavailable due to a memory leak . But why is there a memory leak , There's no clue yet .

  • journal

Apply for machine permission , Check the log , Found a very strange WARN journal

[dubbo-future-timeout-thread-1] WARN org.apache.dubbo.common.timer.HashedWheelTimer$HashedWheelTimeout
-  [DUBBO] An exception was thrown by TimerTask., dubbo version: 2.7.12, current host:
rejected from java.util.concurrent.ThreadPoolExecutor@7a9f0e84[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 21]

It can be seen that the business party uses 2.7.12 Version of dubbo

Take this log dubbo Of github The warehouse searched , Found the following issue:


But the problem was quickly ruled out , Because in 2.7.12 There is already fixed code in the version .

Continue to find these two issue:

Judging from the error report and version , It's completely in line , But there is no mention of memory problems , Forget the memory problem first , See if you can follow #8188 This issue Reappear


issue It is also clear how to reproduce , So I took these three services to reproduce , It didn't reappear at first . Push back by fixing the code


There is a problem deleting the code part , But it's hard for us to get into this , How can I get in ?

Here is a feature On behalf of a request , Only when the request is not completed will it enter , That's easy , Give Way provider Never return , It can certainly be achieved , So in provider Add end test code


After the test, it reappeared , Such as issue said , When kill -9 Drop the first provider when , The overall situation of consumers ExecutorService Shut down , When kill -9 the second provider when ,SHARED_EXECUTOR Also closed .

So what is this thread pool used for ?

It's in HashedWheelTimer Is used to detect consumer Whether the request timed out .

HashedWheelTimer yes dubbo Implementation of a time wheel to detect whether the request times out , The details will not be expanded here , You can write a detailed article another day dubbo Middle time wheel algorithm .

When the request is sent , If you can return normally, it's ok , But if it exceeds the set timeout, it has not returned , You need the task of this thread pool to detect , Interrupt a task that has timed out .

The following code is to submit the task , When the thread pool is closed , When you submit a task, you throw an exception , Timeout cannot be detected .

public void expire() {
    if (!compareAndSetState(ST_INIT, ST_EXPIRED)) {
    try {;
    } catch (Throwable t) {
        if (logger.isWarnEnabled()) {
            logger.warn("An exception was thrown by " + TimerTask.class.getSimpleName() + '.', t);

Here, I suddenly realized : If the request keeps sending , No timeout , Is it possible to burst the memory ? So I simulated it again , And opened 3 A thread has been requesting  provider, Sure enough, the memory burst scene reappears , And when it doesn't trigger the problem , Memory is always stable at a low level .


I used it here arthas Look at the memory changes , Very convenient

Come to the conclusion

After local reproduction , So check with the business side , The recurrence of this problem is still relatively harsh , First of all Asynchronous call , secondly provider Abnormal offline is required , Last  provider There needs to be a blockage , That is, the request never returns .

The asynchronous call is confirmed by the business party ,provider Abnormal offline , This is more common , This happens when the container drifts due to the failure of the physical machine , Last provider This has been confirmed by the business party , exactly C The service had a machine that froze near that point in time , Unable to process request , But the process is alive .

So the question is dubbo 2.7.12 Of bug Lead to . Look at this bug yes 2.7.10 introduce , 2.7.13 Repair .


It's almost spent 1 Days to locate and reproduce , Fairly smooth , Good luck , No detours , But there are also some areas that need attention .

  • It's best to keep the scene while stopping the loss , If this time before restart dump Remove the memory or remove the flow to keep the machine on site , May help speed up locating problems . Such as configuration OOM Automatically dump Memory and other means . This is also the deficiency of this accident
  • Observability of services is very important , Whether it's a log 、 Monitoring or other , Everything should be complete . Basic, such as log 、 exit 、 Import request monitoring 、 Machine index ( Memory 、CPU、 Network, etc )、JVM monitor ( Thread pool 、GC etc. ). This is OK , There are basically everything that should be
  • Open source products , You can search the network from the Key log , The problems you encounter with great probability have also been encountered by everyone . This is also the lucky point this time , A lot of detours

WeChat official account " Master bug catcher ", Back end technology sharing , Architecture design 、 performance optimization 、 Source code reading 、 Troubleshoot problems 、 Step on the pit practice .

- END -
本文为[Insect catching master]所创,转载请带上原文链接,感谢

  1. Day17 Java Foundation
  2. Day18 Java Foundation
  3. Linux installe JDK 1.8 et configure les variables d'environnement
  4. Tutoriel d'utilisation Maven super détaillé
  5. Spring boot reads project parameter configuration
  6. Docker installing rocketmq
  7. Java Zero Basic small white Beginner must make a summary of issues (recommended Collection) Chapitre 1
  8. Manuel pour vous apprendre à utiliser le développement Java pour générer des documents PDF en ligne
  9. 40 + comment les femmes s'habillent - elles pour montrer leur jeunesse?Un manteau et une jupe vous donnent un look haut de gamme tout au long de l'automne et de l'hiver
  10. Tutoriel d'installation Ubuntu 16.04 / Hadoop 3.1.3Configuration autonome / pseudo - distribuée
  11. L'apprentissage le plus détaillé de springboot à l'échelle du réseau - day01
  12. L'apprentissage le plus détaillé de springboot sur le Web - day02
  13. L'apprentissage le plus détaillé de springboot sur le Web - day03
  14. L'apprentissage le plus détaillé de springboot sur le Web - day04
  15. Tutoriel d'utilisation Maven super détaillé
  16. L'apprentissage le plus détaillé de springboot sur le Web - day05
  17. L'apprentissage le plus détaillé de springboot sur le Web - day06
  18. L'apprentissage le plus détaillé de springboot sur le Web - day07
  19. Introduction to JavaScript - write a photo album for your girlfriend
  20. [Hadoop 3. X] HDFS storage type and storage strategy (V) overview
  21. L'apprentissage le plus détaillé de springboot sur le Web - day08
  22. Introduction à la page Web de rabbitmq (3)
  23. No Converter found for return value of type: class java.util.arraylist Error Problem
  24. (16) , spring cloud stream message driven
  25. Que faut - il apprendre de l'architecture des microservices Spring Cloud?
  26. Résolution: erreur: Java: distribution cible invalide: 11problème d'erreur
  27. Springboot démarre en une minute et sort de l'enfer de la configuration SSM!
  28. Maven - un outil de gestion essentiel pour les grands projets d'usine, de l'introduction à la maîtrise![️ Collection recommandée]
  29. ️ Push to interview in Large Factory ᥧ - - Spring Boot Automatic Assembly Principle
  30. [️ springboot Template Engine] - thymeleaf
  31. Springboot - MVC Automatic configuration Principle
  32. Mybatis reverse engineering and the use of new version mybatisplus 3.4 reverse engineering
  33. Base de données MySQL - transactions et index
  34. Sécurité du printemps - [authentification, autorisation, déconnexion et contrôle des droits]
  35. Moteur de base de données InnoDB diffère de myisam
  36. Swagger - [springboot Integrated Swagger, configure Swagger, configure scan Interface, configure API Group]
  37. Cadre de sécurité Shiro - [QUICKstart, login Block, User Authentication, request Authorization]
  38. [Introduction à Java] installation de l'environnement de développement - Introduction à Java et construction de l'environnement
  39. 【 linux】 notes d'utilisation tmux
  40. MySQL + mybatis paging query - database series learning notes
  41. Usage relations and differences of count (1), count (*) and count (a field) in MySQL
  42. 2021 Ali Java advanced interview questions sharing, Java Architect interview materials
  43. Mybatis - dynamic SQL statement - if usage - MySQL series learning notes
  44. [go to Dachang series] deeply understand the use of where 1 = 1 in MySQL
  45. [secret room escape game theme ranking list] Based on spring MVC + Spring + mybatis
  46. Redis log: the killer mace of fearless downtime and rapid recovery
  47. 5 minutes to build redis cluster mode and sentinel mode with docker
  48. Java小白入门200例106之遍历ArrayList的几种方式
  49. Java小白入门200例105之Java ArrayList类
  50. Java小白入门200例104之JDK自带记录日志类logging
  51. Practice of high availability architecture of Tongcheng travel network based on rocketmq
  52. Chapter 9 - Linux learning will - file archiving and compression tar --- zip
  53. Java小白入門200例104之JDK自帶記錄日志類logging
  54. JDK avec journalisation de classe dans 200 cas 104
  55. Java ArrayList Class for Introduction to Java LITTLE WHITE 200 example 105
  56. Plusieurs façons de traverser ArrayList à partir de 200 exemples 106
  57. Provectus / Kafka UI: open source Apache Kafka's Web GUI Graphical interface management tool
  58. Design pattern series: Singleton pattern
  59. Java小白入門200例105之Java ArrayList類
  60. Understanding Java record types