Preface

Before you know it , Technology life series · The story of me and data center comes to the fourth issue . Small y Here we are again !

When you see the business system pressure measurement presents the following wavy tps Curve time , How would you do it ?

wpsE3A8.tmp

Small y( Zhongyi Technology ) Today I want to share with you the analysis and solution process of such a business system pressure test performance problem . This problem has been bothering customers for quite a long time , Fortunately, , Small y By remote in 10 Minutes to identify the cause of the problem and help customers eventually solve the problem . It should be noted that , In this CASE in , It's not enough just to adjust database parameters , Other analysis and adjustments are needed to solve the problem .

In order to keep the original flavor , At the same time, increase the interest of the article , Small y In addition to adhering to the writing method of previous analysis reports , Will try to introduce some psychological process of problem solving , I hope my friends can learn about little y The real working state of .

What can we learn

Ø Oracle The database is 11.2.0.3 An important performance related parameter that must be adjusted on the above version !

Ø How to adhere to or quickly adjust the direction of problem screening after diagnosis failure !

Ø How to deal with cross team / Some experiences of taking the initiative in comprehensive problems of departments !

reminder

If your high concurrency 、 Transactional OLTP Performance jitters often occur in core business systems . For example, the transaction response time suddenly rises sharply , Accompanied by ap Number of server ports / Number of process activities /jdbc The number of connections increases 、 Database per second DB TIME The phenomenon of higher education , also Oracle Version in 11.2.0.3 Or more , Well, it's probably related to an important parameter mentioned in this article ! If it can't be solved after adjusting the parameters , You can contact little y Diagnosis (mian fei de).

Your praise and forwarding is small y Continue to be motivated to share !

more Oracle It's the first time to share the practical experience of database , All in “ China and Antu ” official account ! Welcome to your attention .

Part 1

The problem is coming.

In the morning 10 spot ,QQ All of a sudden it flashed , Come to live !

Small y, Do you have a second? ? Help to see awr.

I'll call you later to talk about the situation .

This year they launched a new key business system , During the pre launch stress test , Application concurrency can't meet the requirements of concurrency index and response time index before going online . During pressure measurement TPS The curve of is as follows : From QQ The news is from a large domestic airline DBA, He can solve general problems by himself , This time it looks like he's in trouble .

wpsE3B9.tmp

You can see , During pressure measurement TPS It's wavy , Extremely unstable .

The customer did a lot of analysis himself , Resource level ,CPU、 Memory usage 、IO All normal , But the customer also found out , During pressure measurement, the back end Oracle There are a lot of abnormal waiting in the database , Mainly gc Types of waiting , The customer doubts if there is a problem with the private network switch . But unfortunately , The network team didn't check for any anomalies .

This problem , They also asked the existing Oracle The service providers analyzed , But the problem has not been solved . thus , It's getting closer and closer to the time when the business system is required to go online , The pressure on customers is also increasing !

Small y I've been talking to this customer recently , I sincerely hope to have the opportunity to serve them , So here comes the opportunity , Small y Of course, I've got 12 points of spirit , Ready to fight .

Introduction to the environment :

operating system Redhat 64 bit,64C 128G

database Oracle 11.2.0.3 ,2 node RAC

Part 2

The analysis process

2.1 analysis Oracle Database per second DB TIME

wpsE3BA.tmp

We use it DB Time Divide Elapsed, You can see that every second DB TIME achieve 75! This is extremely abnormal .

It shows that the database is experiencing a serious wait , You need to look at the database top Wait for the event to continue analyzing .

2.2 Analyze where the trading time is spent (TOP 5 wait event)

1) node 1 The waiting events are as follows :

wpsE3BB.tmp

Event analysis

ØOracle top 5 Waiting ,gc buffer busy acquire Come first , Account for the 51.2%, The average waiting time is amazing 277 millisecond ! there gc buffer busy acquire In progress A There has been a process before B Go ahead node 2 Request the same block of data , And it's not finished yet , So it's on the waiting list .

Ø In the second place is log file sync, Account for the 18.58%, The average time is 293 millisecond . there log file sync When a process issues commit when , Need to wait lgwr Backstage will be log buffer The change vector in is persisted to the redo log What happened in the waiting .

Ø In the third place is DB CPU, In an hour of sampling , The total waiting time is 24648 second , That is to say, it probably accounts for 7 star CPU Time , The server is configured with 64 star CPU, So on average CPU The utilization rate is just 10%. Here small y By the way , Usually , We expect DB CPU The higher the proportion, the better , That means SQL In the process of execution , There's almost no waiting ,SQL The faster the response time is . But it doesn't mean there's no problem , For example, high logic reading SQL, If the data to be operated is all in memory , It can also lead to DB CPU Too high , At this point, we need to read high logic SQL Optimized , To reduce DB CPU.

Ø In fourth place is direct path read, The average waiting time is up 153 millisecond . there direct path read Indicates that the process reads the data block directly into PGA Memory, not read in buffer cache Shared memory . In this case ,IO It's going to be much bigger , Each process reads its own, even the same data . If different processes are reading the same data at the same time , And read into shared memory , Then only one process is responsible for reading , Other processes can directly operate the data in memory , here IO The throughput will be much smaller .

Ø In fifth place is buffer busy wait, The average waiting time is amazing 499 millisecond . there buffer busy wait When two or more processes need to write to a data block at the same time / Write 、 Write / Conflict during read operation , Hot block conflict .

See here , Small Y I already know the answer !

But from a rigorous point of view , Still have to put RAC 2 The waiting event of the node also passes a little bit .

2) node 2 The waiting events are as follows :

wpsE3CB.tmp

And nodes 1 comparison , No, buffer busy wait, More gc current block busy.

On the whole , There is little difference between the waiting events of the two nodes !

2.3 front 2 It's small in minutes y Our brainstorming

2.3.1 Is it right? RAC The problem of private Internet ?

See here , Some might say :

gc Waiting so high , Is it the other one RAC The node is temporarily turned off , The problem is solved ?

The first answer is NO! secondly , This is an emergency in production gc It can be used temporarily when there is a performance problem , But for such a case, The customer obviously doesn't accept it .

Small y Answer this question from a technical point of view .

First , As shown in the figure below

The private network of two nodes is only per second 3M, and RAC The two servers are configured with Gigabit switches for private networks .

wpsE3CC.tmp

secondly ,RAC two-node CPU And memory are low , There is no resource problem that causes a node to run slowly , This leads to the failure to respond quickly to another node's gc Requested situation .

If it's such a problem , We can usually see gc*congested* Types of waiting ( Traffic jam ).

2.3.2 Is it right? SQL Efficiency leads to gc/bbw/direct path read The problem of ?

among bbw namely buffer busy wait,

gc It indicates that the gc buffer busy acquire etc. gc wait for .

Some might say :

gc Waiting so high , also buffer busy wait wait for , If SQL Efficient enough , So there are fewer data blocks to access , So the process happens gc The number of requests is very small , At the same time, because of reading / The hot block conflict caused by writing is gone .

The answer is NO!

See the picture below , You can see that the app is quite well written , Most of the SQL It's all under control 100 Read the following logic , Only 3 individual SQL The logic of reading in thousands to tens of thousands , In this way SQL Efficiency and the number of logical reads are not enough to lead to such a high gc/bbw wait for ! in addition , Fall on SQL The efficiency is not high , There is no way to explain log file sync/direct path read Also in the average single long wait ! The wrong direction will not solve the fundamental problem ! in other words , Even if you spend more time optimizing these logics, read a little higher SQL, The problem of pressure measurement will still exist , Because it's not the root cause , Optimize SQL For this CASE It's icing on the cake, not sending charcoal in the snow !

wpsE3CD.tmp

2.3.3 Is it right? direct path read Lead to IO The problem of full bandwidth ?

Some might say , Will there be such a possibility :

Ø First, direct path read Lead to IO Bandwidth is full

explain : Multiple processes read data blocks into PGA Private memory instead of buffer cache Shared memory , Read in blocks 16 Calculation , Every BLOCK 8K, Each process can read 30M about ,15 More than one process can read multiple blocks at the same time hba Card bandwidth is full , Set up 10949 event This feature can be banned .

Ø because IO Bandwidth is full , Affected lgwr Response time for logging , And that leads to log file sync It's a long wait .

Ø and log file sync again gc and buffer busy wait A part of , So that gc and buffer busy Waiting time goes up , So it's happening AWR Waiting for the report ?

First , Friends who can make that assumption , It can be small y Send a resume , It shows that you have a very deep understanding of the database , And there's a lot of TroubleShooting Experience , And it's no longer on the level of fragmented analysis !

Welcome to Zhongyi technology DBA The team ! Here comes the brother , We fight side by side , Challenge all kinds of difficult problems together , Share the benefits together !

Resume to 51994106@qq.com

that Log file sync and gc What does it matter ?

Quote a piece of RAC SG Graph , The principle is shown in the figure below

wpsE3DE.tmp

You can see from the above picture that :

gc Class , The second step includes lgwr The process of a process writing a log ,

namely log file sync yes gc A sub step of the request , Strictly speaking , This step is called gc log flush sync.

But the answer is still NO!

From the chart load profile Can be seen in , The physical read per second is 498 individual BLOCK, Every BLOCK yes 8K, That is to say, every second IO only 4M about .IOPS and IO Bandwidth is very low , Obviously not the problem !

wpsE3EF.tmp

2.3.42. Small y Quickly lock in the direction of problem analysis !

Small y In these two minutes, just like the above analysis , All kinds of hypotheses and exclusions are going on at full speed 、 The problems are in series .

Soon it's small y It's the analysis direction of the problem —— That's the focus of the analysis log file sync Wait up !

The reason is simple , Through analysis top 5 wait for , It's not hard to see. , There is a connection between them :

log file sync yes gc and buffer busy wait A part of !( see 2.3.3 Map in )

If log file sync Waiting to be solved , natural gc* Waiting and buffer busy wait The waiting will go on !

The problem is solved !

2.4 Focus on “log file sync“ Wait up

From above , We already know ,“log file sync” Wait for the event to indicate :

When the process issues commit when , Need to wait lgwr Backstage will be log buffer The change vector in is persisted to the redo log What happens in the process of waiting . therefore , The most common is lgwr Slow to log , Or because commit Too often leads to !

Next little y These two aspects are examined in turn .

ORACLE among , If lgwr Slow to log , It will be reflected in log file parallel write The single response time is slower .

node 1

wpsE3F0.tmp

node 2

wpsE3F1.tmp

You can see , Two nodes, no matter log file parallel write still gc log flush sync It's all about 5 Less than a millisecond , among log file parallel write It's just 1 Milliseconds and 3 millisecond . Eliminate the problem !

Next check commit frequency !

As shown in the figure below , Every second transactions(commits/rollbacks) Only 48 individual !

Small y In the high concurrency core system of some large banks we have served , Including the number of transactions per second in 10000 The above ,log file sync It's all under control 10 Within a millisecond . So every second transactions Only 48 It's a very small indicator , Not to cause such a serious wait !

wpsE401.tmp

2.5 The reason is basically fixed and the first adjustment is started

Analysis here , Small y The root cause of this pressure measurement has been found , Just adjust the validation .

I suggest friends , You can stop reading here , See if you find the cause of the problem .

That's the customer AWR Two minutes after the report came , Small y Tell him

“ I know why , You put lgwr Process trace Send me a final confirmation , We start to adjust ”

It's not surprising , In this way case Small y Did a lot of system upgrades a few years ago to 11g When I was young, I met N Time ! The customer was surprised , He didn't even have time to make the phone small y, Small y How can this be done …

This should be a standard configuration before going online , Although the phenomenon is different , But it's essentially a problem .

That's the little y An important parameter in the database .

If Log file sync It's a long wait for events , however lgwr The time to log is very fast , also commit If the number of times is small , That's to start commit The process and lgwr There's something wrong with the communication between them .

Key knowledge points :

ORACLE from 11G Start , by lgwr Logging introduces polling Mechanism , Before, it was just post/wait Mechanism .

At the same time, an implicit parameter is introduced ,"_use_adaptive_log_file_sync", That is, adaptive switching between the two mechanisms . stay 11.2.0.3 following , The default value for this parameter is false, That is, only post/wait Mechanism .

from 11.2.0.3 Start , The default value for this parameter is true, namely Oracle Will be in post/wait The mechanism and polling Adaptive mechanism .

Ø Post/wait Base ,lgwr The process is finished log buffer After changing the vector in , Let me know immediately commit The process of , therefore log file sync Short waiting time , but lgwr relatively speaking , The burden is heavier . After all 12C following lgwr Process only 1 individual , At the same time commit When there are more processes , Notice to be given commit It's also a burden .

Ø Polling In mode , stay commit The process of , notice lgwr After the process writes , Will enter sleep link , And in timeout I'll see if log buffer The contents of are written to disk ,lgwr The process is no longer notified separately commit The process of writing has been completed .Polling Under the mechanism , Liberated a part of lgwr The job of , But it's going to bring commit For a long time log file sync Waiting for the . For transactional systems , This mechanism is extremely inapplicable !

stay post/wait and polling Switching between mechanisms ,ORACLE Records to lgwr Process trace among , As shown below .

When you switch to polling In mode , It's easy to cause log file sync Waiting affects the response time of the transaction !

Log file sync switching to polling

……

Log file sync switching to post/wait

stay Oracle 11.2.0.3 following , It is recommended to turn off adaptive log file sync, Make sure that lgwr The process is running in post/wait Under the mechanism , To ensure that database performance does not suffer from large jitters ! The order to close is as follows , It can be modified online ! therefore , Small y I'm here to remind you

alter system set "_use_adaptive_log_file_sync"=false sid='*';

you 're right , Small y The first adjustment is to adjust the parameter to false.

2.6 The result of the first adjustment is disappointing !

After adjusting the parameters online , For safety's sake , The customer rebooted the databases of both nodes .

And did the stress test again , After re collecting AWR The report is as follows !

wpsE402.tmp

See the nodes 1 This AWR The report ,gc Waiting and log file sync The wait is still there , And it looks like a single wait is longer !

Is it small y There's something wrong with my analysis ? Or say , Small y This time, we encountered several problems mixed together ? Calm down ,RAC The problem of , Remember to look at a single node , therefore , Small y Let the customer out of the node 2 Of AWR The report , After adjusting the nodes 2 Of awr The report is shown in the figure below :

wpsE403.tmp

You can see :

Although the waiting is still there , But node 2 Of log file sync There's no waiting ! This shows that the adjustment has worked !

And careful friends , It may have been discovered , node 1 The first one is waiting for gc buffer busy acquire It's completely gone ( That node 2 log file sync fast ), from gc buffer busy acquire Turned into gc buffer busy release. It just means that the adjustment has worked ?

Come here , Don't worry , Here, because nodes 1 There is still a log file sync, So node 2 Of gc buffer busy acquire It's still there ! So next , Small y We need to focus on solving the problem 1 Of log file sync Just fine !

2.7 The truth came to the surface ( Doubt everything )

To sum up , Adjust here log file sync After adaptation , The problem is still unsolved , So back to the traditional way of thinking , The most likely problem is lgwr The process logs slowly ! although awr In the report log file parallel write The indicator is only a few milliseconds , however awr After all, the report is a tool , It's just a reference value , So we still have to be skeptical about everything , One more setcha !

This time, , Let's see in real time lgwr The process logs . issue SQL sentence , The results are shown in the following figure :

wpsE414.tmp

You can see :

Ø RAC In two nodes , Only 1 Nodes appear log file parallel write The waiting for , It's consistent with all the previous analyses !

Ø stay state yes waiting Under the circumstances ,log file parallel Waiting for the seq# All are 35693, however seconds_in_wait Reached 21 second . Simply speaking , Namely lgwr Process write a IO need 21 second !

thus , We can be sure of ,IO There's something wrong with the subsystem , We need to focus on the investigation IO The fiber line under the path 、SAN Switch 、 Stored error and performance .

2.8 How to further prove IO There is a problem in the path link ( Cross departmental cooperation )

Considering the storage team on the customer's side / The Department may not recognize the database's IO Slow evidence , At the same time, in order to increase the investigation efforts of the other party , Small y Ask the customer to issue the following command , Check out multipath Software IO situation , The results are shown in the following figure :

wpsE424.tmp

node 1 It's obvious that IO ERROR, And it's increasing !

Continue checking nodes 2, Discovery node 2 There is nothing IO ERROR!

Review the previous analysis , node 2 In tuning database adaption log file sync by false after , And it didn't show up IO ERROR, So there is no more log file sync.

thus , End of analysis ! All the questions are perfectly explained !

We found the reason , And I got a lot of convincing evidence , The customer finally breathed a sigh of relief , I'm not afraid that the storage team won't admit it !

2.9 The problem has been satisfactorily solved

In the face of hard evidence , The customer's storage team is no longer struggling , Instead, they began to seriously investigate one by one , Finally, after replacing the optical fiber line, the problem was solved successfully . The following are the waiting events for another pressure measurement after replacing the fiber optic cable !

wpsE425.tmp

Pressure measurement TPS The curve changes from wavy to wavy :

wpsE426.tmp

It turns into a good curve like this

wpsE437.tmp

Part 3

Cause of the problem and summary and risk tips

3.1 Summary of problems

The reason why the pressure measurement can not meet the concurrency and response time indicators when the aviation customer service is online is that two problems are mixed together at the same time :

1)Oracle 11.2.0.3 On log file sync Adaptive is on by default , When you switch to polling After the model , Lead to log file sync The waiting time gets longer , and log file sync yes gc and buffer busy wait A part of , So it leads to a lot of waiting

Small y take "_use_adaptive_log_file_sync" Adjusted for false after , Part of the problem is solved log file sync The question of waiting

2) Due to the node 1 There's a quality problem with our fiber optic cable , It can lead to IO error , And that leads to IO retransmission , Affected lgwr The performance of logging .

After adjusting the default values of database parameters and replacing the fiber optic cable , The problem has been satisfactorily solved .

Pressure measurement TPS The curve changes from wavy to wavy

wpsE438.tmp

It turns into a good curve like this

wpsE439.tmp

3.2 Review the key points to solve the problem

1) Yes Oracle Wait for the event and don't analyze it separately

Small y In Ben case The common point of waiting events is log file sync, So we found a breakthrough

2) Understand the characteristics and behaviors of different versions of databases

Small y I'm constantly learning about 11g New features , And through a lot of fault handling, we have a deep understanding of these characteristics , therefore , When log file sync When it appears , You can quickly locate new features that cause

3) Don't totally believe in AWR The report , He's just a tool , Doubt everything to verify .

In this case in ,awr The indicators in the report do not really reflect lgwr Write performance , Doubt everything

4) One Oracle service personal , If you only know database , You will be suspicious of that , But other people don't admit it at all , So we have to master more, including the operating system 、 Storage 、 The Internet 、 Middleware skills . Yes, of course , It's also a good choice to find a service provider with strong comprehensive service ability .

In this case in , Small y Through multipath commands , Found direct evidence , Finally, the key to the final solution of this problem is to get a lot of investigation from other teams .

Risk warning

ORACLE from 11G Start , by lgwr Logging introduces polling Mechanism , Before, it was just post/wait Mechanism .

At the same time, an implicit parameter is introduced ,"_use_adaptive_log_file_sync", That is, adaptive switching between the two mechanisms . stay 11.2.0.3 following , The default value for this parameter is false, That is, only post/wait Mechanism .

from 11.2.0.3 Start , The default value for this parameter is true, namely Oracle Will be in post/wait The mechanism and polling Adaptive mechanism .

Ø Post/wait Base ,lgwr The process is finished log buffer After changing the vector in , Let me know immediately commit The process of , therefore log file sync Short waiting time , but lgwr relatively speaking , The burden is heavier . After all 12C following lgwr Process only 1 individual , At the same time commit When there are more processes , Notice to be given commit It's also a burden .

Ø Polling In mode , stay commit The process of , notice lgwr After the process writes , Will enter sleep link , And in timeout I'll see if log buffer The contents of are written to disk ,lgwr The process is no longer notified separately commit The process of writing has been completed .Polling Under the mechanism , Liberated a part of lgwr The job of , But it's going to bring commit For a long time log file sync Waiting for the . For transactional systems , This mechanism is extremely inapplicable !

Switching between decimal and decimal is recorded back to lgwr Process trace among , As shown below .

When you switch to polling In mode , It's easy to cause log file sync Waiting affects the response time of the transaction !

Log file sync switching to polling

……

Log file sync switching to post/wait

therefore , Small y I'm here to remind you .

stay Oracle 11.2.0.3 following , It is recommended to turn off adaptive log file sync, Make sure that lgwr The process is running in post/wait Under the mechanism , To ensure that there is no significant performance jitter in the database ! The order to close is as follows , It can be modified online !

About Me