Search this Blog

Tuesday, May 11, 2010

Monitoring Cluster Resources

Who never wanted to have an automatic monitoring script for monitoring the RAC resources??

I did !!

Okay, OEM does monitor the systems and report whenever there is a problem, but in the installation period of our RAC cluster we used to have cluster resources dying on us, from no reason at all.
OEM never saw them die. Cluster resources like nodeapps, the TAF service or even ASM sometimes died.
In the end it turned out to be an imcompatibility between Oracle versions or the fact that the installation somehow has not been completely successfull ( we later found out that the Inventory on node2 did not know of all the installed Oracle installations).

In that period I wrote a script that checks for cluster resources being unintented OFFLINE. It has the ability to check the intended state and whenever a resource is unintended down, it sends and Email to a given Email list.
Outside office hours it also sends Emails to "mobile phone addresses".
This requires ofcourse a mailserver, able to send SMS-text messages, but that is another step.

The script looks like below. Feel free to copy it, and/or adapt it to your needs. I am also open to your hints/tips whatever.

Note that on our system the CRS owner=crsprd and we have a seperate location for tnsnames.ora.

The script:

##################################################################################
#
# Check if all Resources are online. If not report by Email or SMS
#
#
##################################################################################
export ORACLE_BASE=/opt/$LOGNAME/ora
export ORACLE_HOME=$ORACLE_BASE/11.1.0
export CRS_HOME=/opt/crsprd/ora/11.1.0
export LD_LIBRARY_PATH=$ORACLE_HOME/lib:$CRS_HOME/lib:$ORACLE_HOME/lib32:$CRS_HOME/lib32
export LIBPATH=$LD_LIBRARY_PATH
export PATH=$ORACLE_HOME/bin:$CRS_HOME/bin:$PATH
export TNS_ADMIN=/opt/oraadmin/network/admin

mailto=
smsto=
export workfile=/tmp/CheckCRSResource.wrk
export logfile=/tmp/CheckCRSResource.log


CRSSTAT=`$CRS_HOME/bin/crs_stat -t grep OFFLINE grep -v grep wc -l`
HOUR=`date +%H`
NODELIST=`$CRS_HOME/bin/olsnodes`
OFFLINEENTRYFOUND=FALSE
COLLENGTH=47 # For lining up the output

# Checks in the root crontab are executed every 5 minutes.
# However a SMS every 5 minutes can be very annoying.
# Therefore this script keeps track of time an the amount of SMS's sent.
# It sends an SMS the first time, then after 5 minutes, then after 10, then after 15 etc.. increaing the delay every time
if [ "$CRSSTAT" -gt "0" ]
then
echo "Below ClusterResources have problems: \n\n" > /tmp/CheckResource.log
# Put detailed info into a temporary file
$CRS_HOME/bin/crs_stat -f grep -E "NAMETARGETSTATE" grep -v USR_ORA > /tmp/crs_stat.$$

# Now start reading the file
cat /tmp/crs_stat.$$ while read RESOURCE
do
read INTENDED
read CURSTATE
RESOURCELENGTH=`echo $RESOURCE wc -c`
REMAIN=`expr $COLLENGTH - $RESOURCELENGTH`
ISOFFLINE=`echo $CURSTATE grep OFFLINE grep -v grep `
INTENDEDONLINE=`echo $INTENDED grep ONLINE grep -v grep `
if [ "$CURSTATE" = "$ISOFFLINE" ]
then
if [ "$INTENDEDONLINE" = "$INTENDED" ]
then
# This resource seems to be offline however it should be online
OFFLINEENTRYFOUND=TRUE
echo "$RESOURCE\c" >> /tmp/CheckResource.log
while [ $REMAIN -gt 1 ]
do
echo " \c" >> /tmp/CheckResource.log
REMAINTMP=`expr $REMAIN - 1`
REMAIN=$REMAINTMP
done
echo "\c" >> /tmp/CheckResource.log
echo "intended state ONLINE currently OFFLINE\n" >> /tmp/CheckResource.log
fi
fi
done
if [ "$OFFLINEENTRYFOUND" = "TRUE" ]
then
if [ "$HOUR" -gt "7" -a "$HOUR" -lt "17" ]
then
echo "cursec greater than countsec: Mail sent...\n" >> $logfile
cat /tmp/CheckResource.log mail -s"CRS Resource down at RAC" $mailto
else
cat /tmp/CheckResource.log
cat /tmp/CheckResource.log mail -s"CRS Resource down at RAC" $mailto
cat /tmp/CheckResource.log mail -s"CRS Resource down at RAC: Check Email" $smsto
fi
fi
fi
rm -f /tmp/crs_stat.$$ /tmp/CheckResource.log