Preface

Before you know it , Technical life · The story of me and the data center comes to the second issue , Some friends begin to care about children y Who is it? , It doesn't matter , We are more concerned about the technical level of sharing and the actual risk to customers . In the future, we will continue to share the operating system , The development of Middleware W The story of .... Small y The name , In fact, there is no special meaning , For the time being, let's use him to represent those of us who are dedicated to the operation and maintenance of the data center !

Sharing theme of this issue

Small y Today I want to share with you the following serious topic :

Yours Oracle RAC Is it really high availability ? Or pseudo high availability ?

let me put it another way :

When Oracle RAC The partition of a node in the cluster / When the server goes down ,

Can you clap your chest with the leader and say ,

“ Don't worry , This is a ORACLE RAC, There's another node ! As long as the node can resist the load , It can provide normal services to the outside world !”

If you read that one more time , Do you have the feeling of hesitation ?

Small y Another way to ask about this topic :

Although the system was done before it went online RAC High availability test , But when a node in the cluster runs for a long time , Include CPU、 Memory 、 The load, including the number of processes, is changing , And after a series of changes , During this period, there was no high availability test , In this case , If RAC The partition where a node is located / When the server goes down , Can you still pat your chest and say ,“ my Oracle RAC Other nodes must be able to provide services !”?

Same question , When more words are used to pave the way , You who hear this question again , Is the answer more hesitant ?

Small y Today, I'd like to offer you a “RAC Loss of high availability ” The real case and its integrity of 、 The real analysis process .

What can you learn from the cases that lead to Oracle RAC Some specific factors of high availability failure .

Small y It is estimated that many friends still have similar problems in their systems ,

It is suggested to refer to this case for detailed inspection , Eliminate hidden dangers .

.

Highlights of the case

This case will be very difficult , In order to analyze the root cause of the problem , A lot of people and time , For a time there was no result . Small y It's time to take over case after , In the absence of information , At one time, the analysis of the problem reached a deadlock . But small y Finally, it's by combing all the clues over and over again , A breakthrough was found in a small detail unrelated to the database , Successfully located the cause of the problem . You can use this method for reference .

Part 1

Fault description

The phenomenon :Oracle RAC Loss of high availability . Perform the following

1) Afternoon 16 spot 1 about ,XX System database RAC Cluster nodes 2 Where P595 A hardware failure has occurred , Lead to node 2 The partition where the database is located is not available .

2) But from 16 spot 1 Point start , The application could not connect to the database RAC The surviving nodes in the cluster 1.

ORACLE database RAC Cluster fails to play the role of high availability Architecture !

Customer instructions , Be sure to find the root cause of the problem , In order to improve the high availability architecture of the system .

Small y I understand , When something like this happened , For one running hundreds of sets RAC In terms of our data center , It's a huge risk , Is there such a problem in other systems ? When will it happen again ? If we don't find the root cause of the problem , And how to do a comprehensive carding from point to area 、 Inspection and Prevention ?

Description of the environment :

AIX 5.3

Oracle 10.2 2 node RAC

HACMP+ Bare equipment

therefore , Small y Received the report case when , There's still a lot of pressure . Before starting the analysis , Small y We got the following information :

1) Operation and maintenance in case of failure DBA stay RAC Surviving node 1 adopt sqlplus “/as sysdba” Connect to database pending

2) Operation and maintenance in case of failure DBA stay RAC Surviving node 1 adopt sqlplus -prelim “/as sysdba” Connect to database pending , added -prelim Parameters connect to the database also hang, This is a very rare situation

3) Surviving nodes 1 through crsctl stop crs -f stop it crs Can't stop , Command suspend cannot end

4) Surviving nodes 1 through shutdown –Fr Restart the operating system , Command suspend cannot end , Finally through hmc Restart the partition , Business is back to normal

Part 2

The analysis process

2.1 Fault analysis ideas

A node in the cluster is down , Other nodes can't provide services .

Usually , The reason is that the cluster software does not complete the whole cluster state analysis 、 Data reorganization ,

So the data is not consistent , Therefore, other nodes in the cluster cannot provide external services .

This environment is deployed in IBM On the little plane , The cluster software used in it is :

ORACLE RAC/ORACLE CRS/IBM HACMP

therefore , You need to check whether the three clusters have been reorganized 、 Reconfiguration

2.2 confirm ORACLE RAC Whether to complete the reorganization

Look at a node database alert journal , You can see ,RAC Cluster in 16 spot 1 branch 32 The reorganization was completed in seconds .

The problem can be ruled out .

wps3077.tmp

2.3 confirm ORACLE CRS Whether to complete the reorganization

You can see , from 16:01 Start network heartbeat timeout , Start with the node 2 Do the culling action , In the end in 16:01:27 node 2 Leave the cluster . The problem can be ruled out .

wps3078.tmp

2.4 confirm IBM hacmp Whether to complete the reorganization

the AIX Expert analysis , Not found HACMP Abnormal in

Check the whole thing out , No abnormalities found .

Since nodes 2 Where the database is located Lpar It's down , Then the node 2 Of vip It should drift to nodes 1 On , Next through netstat –in Command check , Discovery node 1 There are no nodes on the Internet 2 Floating over VIP !

To take over VIP In the end, the action of CRSD Process , So check CRSD.log, In order to see if there is any exception during the takeover process .

2.5 Check 1 node crs Log confirmation node 2 vip Take over situation of

wps3079.tmp

2.6 node 1 crs Log summary

You can see :

1) node 2 When it falls , node 1 Of CRS To put nodes 2 Of vip、db Wait for resources to take over , At the node 1 start-up , But calling scripts racgwarp To do it check/start/stop There was a time-out in both timeout, So the subprocess is terminated .

2) And node 1 Its own vip There was also a timeout in the detection , As shown below

/oracle/app/oracle/product/10.2.0/crs/bin/racgwrap(check) timed out for ora.node1.vip! (timeout=60)

3) therefore ,CRS There are also some anomalies in resource management , Mainly calling scripts racgwrap A timeout exception occurred when .

Racgwrap The script appears timeout, Usually because :

Ø Operating system performance is slow , Such as a large number of memory page

Ø A command is suspended during script execution

2.7 Check nmon Data for operating system performance

You can see ,5 month 20 Japanese nmon Actually stopped at the point of failure 16 spot 1 branch , This explanation NMON When a command is executed, there may be an exception such as suspension

wps308A.tmp

in addition , From the perspective of monitoring software , The operating system has no memory 、CPU Alarm of .

2.8 Determine the direction of the analysis

that , Next , Our analysis focuses on the database or the operating system ?

Comb through the clues above , We have reason to believe that :

There was something wrong with the operating system at the time ! Therefore, the following analysis focuses on the operating system level !

2.9 Gather and comb all the clues

1) Surviving node Sqlplus –prelim unable attach To shared memory

2)Kill -9 Can't kill part of the process

The process is in an atomic call , For example, one IO, The signal of process termination must be accepted between two calls , Indicates that an atomic call cannot be terminated, resulting in it cannot be kill -9 End

3)CRS Failed node cannot be taken over by script vip, There is a timeout

4)CRS The script cannot detect the existence of the surviving node itself vip/ Monitoring and other resources , Time out occurred in all cases

5) Surviving node Nmon No output after fault point

2.10 The problem analysis has been in a deadlock for a time

Although the direction is on the operating system , but AIX The expert didn't detect any abnormality .

AIX The expert's conclusion is that the operating system was not abnormal at that time !

Because they checked some crontab Script for , It has output , explain OS Still working .

As shown in the figure below

wps308B.tmp

2.11 How to find a breakthrough

Small y It's oriented to the operating system , But operating system experts check and deny that the operating system is abnormal .

For surviving nodes nmon Why stop writing , The two sides hold different opinions .

thus , Problem analysis is at a standstill , Small y Start thinking , How to continue to analyze ? How can I prove an exception to the operating system ?

1) If you don't give the operating system a clear point , So it's very difficult for the operating system to find out what abnormal problems exist

2) Problem analysis is at a standstill , How to find a breakthrough becomes the key

3) Firmly believe that the direction of operating system exception is right

Back to the origin , Re comb and verify every clue , Is there something important missing

wps309B.tmp

Repeat the lead before the examination , There's a big discovery !

wps30AC.tmp

You can see :

from 16 spot 02 Points and subsequent sampling can see , here we are ” The file system result is as follows” And the test stopped , Didn't write SYSTEM.SH_RUN_COMPLETE Keyword to indicate the completion of script execution . This shows that the operating system also encountered an exception when executing non database commands !

2.12 confirm shell The specific command to call when the script is abnormal

Check shell Script , Found the script hanging on ” The file system result is as follows:” The place of , It's actually called df Command to view the file system .

So what happens to this operation HANG In the case? .

The answer is to use nfs File system time .

XX The system database cluster uses NFS file system , The nodes 2 Of /arch2 File system pass NFS Mount to node 1 Of /arch2 On the file system . When node 2 After a hardware failure , Lead to node 1 Unable to communicate with node 2 Of nfs server Communications , This in turn leads to nodes 1 On ,df Command to view the file system .

From this view , node 1 nmon What is the reason for the data to stop df command hang Live in the

But this and the nodes 1 It doesn't matter if you can't connect to the database ?

When small y notice df On command , Tears ran !

All phenomena can be explained ! When all the phenomena are explained , My heart is at ease ! It means you find the root cause of the problem , Then preventive measures are safe !

2.13 Nfs mount Point missing and unable to connect to database

When performing a connect database operation , Need to get the current working directory (pwd)

But because of AIX Implement some internal versions of the operating system pwd Defects in the process , Results in having to recurse to the root directory / Check directory or file permissions under 、 type .

When it comes to nfs when , because nfs With hard/background Way to mount , When nfs server Is not available , It will inevitably lead to inspection nfs There was a hang in the directory , And that leads to the inability to get pwd Output result of , This results in a failure to connect to the database !

2.14 Solve all the mysteries

1) Surviving node Sqlplus –prelim unable attach To shared memory

When getting the current working directory , because nfs mount Point missing ,get_cwd(pwd) Hang up

2) Kill -9 Can't kill part of the process

The process is right nfs mount Point proceed IO operation , Hang up , So there is no chance to receive a process termination signal

3) CRS Failed node cannot be taken over by script vip, There is a timeout

Racgwrap Invoked in script pwd command

4) CRS The script cannot detect the existence of the surviving node itself vip/ Monitoring and other resources , Time out occurred in all cases

Racgwrap Invoked in script pwd command

5) Surviving node Nmon No output after fault point

Nfs mount Point loss leads to nmon call df Hang on command

2.15 Further analysis

1. Under this mechanism , If the root directory / A lot of small files or directories , be pwd(get_cwd) The performance will be very poor

2. Test environment reproduction process

node 2 mount /testfs To the node 1 Of /testfs, Stop node 2 nfs service , Failed to reproduce . The reason lies in pwd The output of is /oracle, The first letter o Than t smaller , Therefore, it was not detected /testfs You get pwd As a result, I quit

node 2 mount /aa To the node 1 Of /aa, Stop node 2 nfs service , Failed to reproduce . The reason is through truss Command comparison , Discovery production and test environment Retrieval / The behavior in the root directory is inconsistent , Then check OS edition , Discover the test environment OS Higher version

3. The high version of the OS Can't reproduce , Lower versions of the operating system can be reproduced , The operating system has been modified and enhanced , stay ibm.com Already on nfs hang Search for , You can find IBM Released APAR To fix the problem .

Part 3

Summary of reasons and suggestions

3.1 The reason summary

1、RAC Cluster nodes 2 Where P595 A hardware failure has occurred , Lead to node 2 LPAR Unavailable .

And that leads to Nfs mount Point missing .

2、 When logging into the database , Need to get the current working directory (pwd)

3、 But because of AIX Implement some internal versions of the operating system pwd Defects in the process , Lead to have to pass Go to the root / Check directory or file permissions under 、 type .

4、 When it comes to nfs Mount point Directory /arch2 when , because nfs With hard/background The way mount , When nfs server Is not available , It will inevitably lead to inspection nfs There was a hang in the directory , Following And that leads to the inability to get pwd Output result of , This results in a failure to connect to the database !

With hard/background Mounted NFS server Loss is the cause of RAC The root cause of pseudo cluster !

All fault phenomena can be explained , as follows :

1、 Surviving node Sqlplus –prelim unable attach To shared memory

When getting the current working directory , because nfs mount Point missing ,get_cwd(pwd) Hang up

2、Kill -9 Can't kill part of the process

The process is right nfs mount Point proceed IO operation , Hang up , So there is no chance to receive a process termination signal

3、CRS Failed node cannot be taken over by script vip, There is a timeout

Racgwrap Invoked in script pwd command

4、CRS The script cannot detect the existence of the surviving node itself vip/ Monitoring and other resources , Time out occurred in all cases

Racgwrap Invoked in script pwd command

5、 Surviving node Nmon No output after fault point

Nfs mount Point loss leads to nmon call df Hang on command

3.2 Problem solutions and suggestions

1) If you need to be practical nfs, be mount To the secondary directory , such as mount To /home /arch 2 instead of /arch2

2) Use gpfs Instead of nfs

3) When providing fault clues , Don't filter out information that you don't think is important based on personal experience

4) install AIX APAR, Changing the operating system get_cwd(pwd) The internal implementation of the call

5) When dealing with a fault Df command hang If you feed back to the little girl at the beginning y, So this case It just unravels , There's no need to check !

6) It's not enough just to do high availability testing before going online , Because the system will go through a series of change , It's possible that some factors will affect the newspaper RAC Loss of redundancy , It is suggested to do high availability test regularly try , You can choose to restart instances one by one in the change window 、 Server to verify .

About Me