Chapter one   Technology life series  ·  My story with data center ( The sixth issue )-Oracle Memory over consumption risk alert


Preface

Time flies , Technology life series · The story of me and data center has come to the sixth issue , Small y Here we are again !

Small y What I want to share with you today is the analysis and solution process of a comprehensive problem .

Solve this kind of problem , Database is not enough , Also need to master more solid operating system skills .

At the same time, another less common form of optimization is introduced , Memory optimization .

Because the issues to be shared today are universal , It is suggested that you can check whether there are similar problems in your system according to the method in this paper . At the end of the sharing, the common risks will be summarized and prompted .

If you think the shared case is good , Please forward it with your hands , Hope to remind and help more customers .

more Oracle Database real combat experience sharing and risk warning , All in “ China and Antu ” official account ! Welcome to your attention .

in addition , Recently, many friends have asked , Small y Can the team do some offline sharing ?

exactly , If you can organize people , Even a conference room or a coffee shop , In addition to face-to-face technology sharing , You can also talk about life , When you're interested, you can barbecue beer together , Make more friends , It's also a blessing in life !

Since you are interested , Let's start the first phase of offline sharing !

Interested in Beijing 、 Shanghai 、 Guangzhou 、 Shenzhen and other friends sharing under the ground , You can give it to the little girl y email , Email is 51994106@qq.com, Or smaller y WeChat (shadow-huang-bj), Providing cities 、 full name 、 Telephone 、 Company 、 Position and other information , When the number of applicants exceeds 20 people , We'll start Beijing 、 Shanghai 、 Guangzhou 、 Shenzhen 、 Hangzhou 、 Free offline sharing activities in Nanjing and other places . in addition , You can join me if you are interested QQ Group 227189100, We will do some online sharing regularly in the future .

Part 1

The problem is coming.

2015 year 12 month 28 Japan , Beijing .

evening 8 spot , Just after dinner , The phone rings. , It's a phone call from an operator customer .

“ Small y, Had an accident , This afternoon, 17 Point to 18 spot ,xx Database monitoring and instances crash, But now the library is up . This business system is very important , Leaders attach great importance to , Be sure to ask for the cause . Other manufacturers have arrived , No problem has been found for the time being . Can you send an engineer to the scene immediately to analyze ? Try to come to a conclusion tonight !”

After receiving the call , Small y Immediately arrange for my brother to rush to the scene . And then I learned , There was a similar problem last week .

Introduction to the environment :

operating system AIX 6.1 TL8, 64-bit

database ORACLE 11.2.0.3,2 node RAC

To configure :16CPU 50G Memory

Part 2

The analysis process

After a while , Received the log , All of a sudden came the spirit , Let's take a look at the log first , Confirm the database mentioned by the customer crash And monitoring crash The problem of .

Confirm the problem of monitoring and database instance downtime

2.1.1

database alert.log

wps119E.tmp

You can see :

>>17:42:39, database alert The log has key error messages , be responsible for GCS The process of LMS The calls to are 89 Seconds didn't respond . It means from 17:41:10 Seconds to start , The database started to go wrong .

>> Next , And here comes 18:02:48, Go directly to the log of the startup database , No database stops 、 Log terminated abnormally , this 20 Within minutes alert The log doesn't have any output . During this period, the operating system did not restart .

Preliminary analysis of the causes :

Lead to LMS The most common reason for a call to not responding is that the database is located Lpar System resources such as memory /CPU There's a bottleneck . And usually when operating system resources are tight , It's going to be accompanied by :

>> CRS As a process of cluster resource management , There may also be timeouts when detecting resources , To start exception handling , For example, restart resources automatically

>> RAC When some heartbeat detection between nodes fails to respond , In order to recover the whole RAC The external service ability of clusters ,11g After that, priority will be given to start MemberKill, That is, try to recover by terminating the database instance , If memberKill Can't solve the problem , Upgrade to NodeKill, The restart OS To restore the external service capability of the whole cluster

Next check CRSD Log of the process

You can see from it , Include VIP/ Resources such as monitoring , because OS The shortage of resources , Detection timeout , Then it starts exception handling .

2.1.2

node 1 CRSD.LOG

You can see it in the picture below :

17:42:30,CRSD testing VIP The health of ,10 Seconds later , Detection timeout , So the state changes UNKNOWN( Then be CRSD Try to restart )

wps11AE.tmp

You can see it in the picture below :

17:51:18 second , because VIP Dang , And monitoring depends on VIP, So listening is requested to stop .

You can see it in the picture below :

17:54:34, Database resource detected ora.XXDB.db Terminated by exception

2.1.3ocssd.log

You can see from the above picture that :

Actually in 2015-12-28 17:47:17,OCSSD The process receives the node 2 Of Member kill request Request , You need to kill the database instance . actually , here we are 18:02:48, The database just started .

It's been a long time 15 minute , It shows that the database server resources are already very tight , Can cause performance to be so slow , Usually only memory has a large number of page breaks .

2.2 Operating system performance - The hidden danger has long been buried

From the above analysis, we can know that , In fact, from 17:41:10 To 17:42:39, Database nodes 1 System resources began to strain , It's starting to slow down !

2.2.1

see CPU usage

wps11AF.tmp

You can see :

CPU Resources are not a problem ,CPU The peak is not high

2.2.2

View memory page feed (pi/po)

wps11C0.tmp

You can see :

NMON Every time 5 Every minute ,

17 spot 39 The next sampling of the minute is 17 spot 44 branch , But this time the sample was not collected at all !

This shows that the system resources have been very tight ! This is the most important evidence !

The problem is 17 spot 42 branch , Due to the acquisition interval , Not captured , It's normal too .

But it does not affect the overall judgment of this analysis .

in addition , It's not hard to find out , Before something goes wrong ,pageSpace The utilization rate of is up to 12.44! It shows that there has been insufficient memory before ! Small y recommend , If you find your home AIX On the platform pageSpace When used , Be sure to analyze the memory usage , Otherwise it will be a big thunder .

2.2.2

Why does memory page feed appear ?

wps11C1.tmp

Small y When you see that data , Got a fright !

Before something goes wrong , Such as 16:24 To 17:29 Between , The computing memory of the database server (%comp) For a long time 90% I'm in a high position ! This is for AIX Come on , It's very dangerous ! For computational memory , We should try our best to control 80% following , Such a system is out of health !

90% The computational memory of is close to the critical point of memory page feed !

When there are some slightly larger memory requirements , Will cause the system to appear memory page .

So who triggered the page change ?

In small y It seems , If a wall is falling , It doesn't matter who knocked down the wall ! So that's not the point .

alike 17 spot 44 There should be a record on the display , But it didn't come out !

explain 17 spot 44 The subsystem is really in a state of memory tension .

17 spot 49 time-sharing , Computing memory is even more up to 97%!(44 The score is abnormal , It must lead to the accumulation of processes , And then increase the use of memory )

2.3 Memory planning - The client's wishful thinking

The memory size of the database server is 50G, The customer's initial memory planning is as follows

>> SGA To configure 25G

>> PGA Configure to 5G

>> Database parameters processes by 2000, Daily operation in 1000 A process

The memory usage of database can be calculated by the following simple formula :

SGA + PGA+ The process is in OS Class consumption

Under normal circumstances ,11G Version database , Single ORACLE The memory occupied by the service process is 3M,

Therefore, the usual use of memory is 25G+5G+1000*3M=33G, Occupy Lpar In memory 66%.

If there is an exception, wait ,2000 A process is called up , Then the memory usage is 25G+5G+2000*3M=36G, The planning is too big .

2.4 Analyze where memory usage is going - The reality is cruel

Because of the memory usage of the database, the following simple formula can be used to calculate :

SGA + PGA+ The process is in OS Class consumption

here ,SGA It's a hard limit ,PGA It's not a hard limit ( Neither workspace nor non workspace is a hard limit )

Small y adopt dba_hist_pgastat get pga Peak value , Discovery is 5 individual G, No breakthrough PGA The parameter limit of , So the most likely thing is “ The process is in OS Class consumption ” It takes up a lot of memory .

therefore , Small y Pass it now procmap Command to check the memory consumption of a single process :

Find out ORACLE A single memory takes up 10M( Add up the second column )

wps11D2.tmp

Here we are. , Small y I already know the answer !

Readers , You can also stop , To sum up the above phenomena , Think about it for a few minutes , If you're here to pick up this CASE, How do you keep going ?

Don't go away. There's still .....

So where is the memory used ? Small y It's very simple , adopt svmon –P Command to see the details of the memory footprint .

wps11E2.tmp

You can see :

Every ORACLE Service process work Type in exclusive memory , USLA heap Part of it is 1642 Memory pages , And every page 4K, That is, to occupy more 6-7M.

In fact, this is a known version of the operating system and database BUG.

Zhongyi technology has encountered this problem several times in other data centers .

2.5 Don't underestimate 7M Memory

The memory size of the database server is 50G, The customer's initial memory planning is as follows

>> SGA To configure 25G

>> PGA Configure to 5G

>> Database parameters processes by 2000, Daily operation in 1000 A process

Let's take a look at :

Under normal circumstances :

11G Version database , Single ORACLE The memory footprint of the service process is 10M Instead of 3M!

Therefore, the usual use of memory is 25G+5G+1000*10M=40G, The database alone takes up most of the memory 80%! It's more dangerous ! about AIX platform , It is recommended that the database occupy the memory 50%-60% Between , Because of the operating system kernel You'll also use memory , Up to 40%, It's common to see kernel inode cache Use .

If there is an exception, wait ,2000 A process is called up , Then the memory usage is 25G+5G+2000*10M=50G

2.6 Confirm the operating system and database BUG

When you get here , Just a simple .

Small y stay mos In order to USLA Do keyword search , Here we find the corresponding BUG.

The next one note, yes ORACLE On the official website USLA heap This leads to a single process occupying more than 7M Memory ,

from 3M become 10M Of BUG 13443029 Description of .

11gR2Aix - Dedicated Server Processes Have Large Usla Heap Segment Compared To Older Versions (Doc ID 1260095.1)

2.7 How to solve it ?

This problem is a flaw in the operating system , You need to install patches on both the operating system and the database at the same time :

>> about AIX 6.1 TL07 or AIX 7.1 TL01 You need to install patches on both the operating system and the database at the same time .

>> about AIX 6.1 TL07 SP02/AIX 7.1 TL01 SP02 or later, Because the operating system has been fixed , Just install the patch on the database side 13443029.

Database patches 13443029 stay 11.2.0.3 The following is not included in any PSU in ,11.2.0.4 The fix to this problem is included in .

Part 3

Summary of reasons and suggestions

3.1 The reason summary

The memory size of the database server is 50G, The memory plan is as follows

Ø SGA To configure 25G

Ø PGA Configure to 5G

Ø Database parameters processes by 2000

28 The average number of database service processes per day is 1000 about

Due to a known flaw in the operating system and database --11gR2Aix - Dedicated Server Processes Have Large Usla Heap Segment Compared To Older Versions (Doc ID 1260095.1), Results in an idle database service process in USLA Part of more than 7M Left and right private memory .

So the database as a whole occupies 25 G + 5G + 1000*10M=40G, namely 40G About computing memory , The database has occupied 80% Above memory ( It's usually controlled in 60%), add kernel And so on , The database usually runs close to 90% The state of computing memory for . This makes the database server run in high memory , When the number of processes increases 、 Sort hash and other memory requirements , Then there is a memory page change , Drag the whole system very slowly .

When the memory is tight, the cause of this failure , The most direct evidence is as follows :

NMON Every time 5 Every minute ,

17 spot 39 The next sampling of the minute is 17 spot 44 branch , But this sampling was not collected !

This shows that the system resources have been very tight ! This is the most important evidence !

The database cluster itself has detected VIP/ When the database resource is not responding , Exception handling , And that leads to monitoring 、VIP、 Database instance failure .

3.2 Suggest

By solving “ Operating system and database about USLA The defects of “ As well as the database memory parameters planning , Can reduce the use of memory , Make the system run in a more healthy memory condition , So as to fundamentally solve the problem .

1) Install the database patch 13443029, Reset the database to shrsymtab This option to compile , take USLA Part of the 7M Memory reduced to dozens K, So as to free up more 7G Left and right memory ( If the 2000 It's a process , Then free up 14G Memory )

2) Will database SGA Memory sga_target Parameter from 25G Turn it down to 20G.

Adjustment instructions :

After two adjustments , Not for lpar In the case of adding memory ,

The memory planning of the database is 20G+5G+1000*3M=28G, If you count 2000 When a process is full , The memory planning of the database is 20G+5G+2000*3M=31G

So that the system memory resources are more abundant , There will be no page change due to some private memory requirements .

Part 4

Common risk reminder

Small y Through this case , Make some common risk tips :

Don't underestimate a free one Oracle The impact of memory occupied by the service process !

When we do memory planning again , This is often overlooked !

If your database version is 11GR2, And run on AIX6/7 Version of , Then there is likely to be excessive memory consumption in your system , That is, a Oracle Service process is better than 10G There should be more versions 7M Left and right memory , So that a single ORACLE Process from 3M Change to 10M. This is for Oracle For databases with more service processes , It's deadly .

for example , For a running 2000 individual Oracle In terms of the database of the service process , The memory used is not 2000*3=6G, It is 2000*10=20G, More 14G. polymeric 14G It will go beyond your memory plan , Make the database run in a dangerous state . Whether to hit the question , You can refer to the cases shared in the article !

Here are ORACLE On the official website USLA heap This leads to a single process occupying more than 7M Memory ,

from 3M become 10M Of BUG 13443029 Description of .

11gR2Aix - Dedicated Server Processes Have Large Usla Heap Segment Compared To Older Versions (Doc ID 1260095.1)

This problem is a flaw in the operating system , You need to install patches on both the operating system and the database at the same time :

>> about AIX 6.1 TL07 or AIX 7.1 TL01 You need to install patches on both the operating system and the database at the same time .

>> about AIX 6.1 TL07 SP02/AIX 7.1 TL01 SP02 or later, Because the operating system has been fixed , Just install the patch on the database side 13443029.

Database patches 13443029 stay 11.2.0.3 The following is not included in any PSU in ,11.2.0.4 The fix to this problem is included in .

For AIX 6.1 TL07 SP02/AIX 7.1 TL01 SP02 or later, apply patch 13443029

For AIX 6.1 TL07 or AIX 7.1 TL01, install AIX 6.1 TL-07 APAR

IV09580, AIX 7.1 TL-01 APAR IV09541, and apply patch 13443029

For other AIX level, apply patch 10190759, this will disable Oracle's online patching mechanism

Note: as of 06/21/2012, fix for bug 13443029 or bug 10190759 are not included in any PSU and the interim patch is needed. Interim patch 10190759 exists on top of most PSU, and patch 13443029 on top of 11.2.0.3 does not conflict with 11.2.0.3.1 PSU and can be applied on top of both 11.2.0.3 base and 11.2.0.3.1 PSU.

About Me