Search this Blog

Friday, November 12, 2010

RMAN 11g suddenly fails on AIX, nothing changed

Okay, after 1.5 hours digging into the system (and my brains) we have solved yet another mysterious RMAN problem.


What was the case?

After a successfull backup last night, rman suddenly fails to allocate a tape channel while trying to backup archivelogs. The tape channel in this case points to a TSM management layer.
The error message shown was:


RMAN> run {
2> allocate channel ch1 type 'sbt_tape' parms
3> 'ENV=(TDPO_OPTFILE=/usr/tivoli/tsm/client/oracle/bin64/tdpo.opt)';
4> backup
5> incremental level 0
6> tag arch_bk
7> format 'arch_%d_t%t_s%s_p%p'
8> (archivelog all delete input );
9> sql 'ALTER SYSTEM ARCHIVE LOG CURRENT';
10> }
11>
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of allocate command on ch1 channel at 11/12/2010 08:34:42
ORA-19554: error allocating device, device type: SBT_TAPE, device name:
ORA-27000: skgfqsbi: failed to initialize storage subsystem (SBT) layer
IBM AIX RISC System/6000 Error: 106: Reserved errno was encountered
Additional information: 7011
ORA-19511: Error received from media manager layer, error text:
SBT error = 7011, errno = 106, sbtopen: system error

When searching on Google or Oracle Support, the combination of ORA-19554, ORA-27000,ORA-27000 and IBM AIX RISC System/6000 Error: 106 usually point to problems with writing the TSM logfiles ( tdpoerror.log, dsmerror.log). Most of the problem then are solved by setting correct permissions for these files, making the Oracle owner able to write these files.

In our case the permissions were OK, and as said above the backup of last night had run successfully.
Trying manually to establish a channel connection also failed.

At this point I had discussedthe problem with the tape-storage guys about what could be wrong here. They told me TSM was working OK, no problem whatsoever.
They agreed it had to be something on the Oracle server, simply because of the fact RMAN suddenly stopped working, even when nothing had changed.

In the tdpo.opt file I suddenly found the TDPO_TRACE_FILE option be turned on.
This tracefile pointed to a file called 'tdpo.out'

In this file I saw the following entry:

11/12/10 08:27:27.086 [1531908] [1] : init2.cpp ( 359): sbtinit(): Rman Trace File: '/opt/wcsprd/ora/diag/rdbms/wcsprd/WCSPRD1/trace/sbtio.log'

This meant that the one attempt this morning that succeeded to allocate a channel manually, logged this entry in the trace file.
All other attempts allocating a channel hadn't logged this message.

This combined with the :

IBM AIX RISC System/6000 Error: 106: Reserved errno was encountered
Additional information: 7011
ORA-19511: Error received from media manager layer, error text:
SBT error = 7011, errno = 106, sbtopen: system error

I went looking into the trace directory where sbtio.log was located.
In this directory a list of 6886 tracefiles was found. In my opinion this should not be a problem while trying to open a file there.

However it was the cause of the problem !!!

After cleaning this directory, using the adrci tool, RMAN now works again.


Conclusion:

RMAN is a very powerfull tool for making online backups.
It also sometimes is a very misterious tool.
It sometimes stops working when nothing has changed in the configuration, displaying an error message that is not always very clear.
The Automatic Diagnostic Repository (ADR) is a very usefull feature of Oracle 11g. It centralizes all database related logfiles into one location, making problem searching easier.
That this tool however can turn into problems for RMAN, I never would have expected.
Luckily Oracle has included the ADRCI ( ADR commandline interface ) that enables the DBA to easily clean out the ADR.
We now have scheduled a daily cleanout of the ADR, simply to avoid RMAN problems


Finally one remark:
In my honest opinion the fact that there were only 6886 files in the trace directory, should not have caused such a problem for RMAN. The Oracle owner's ulimit permissions are all set to unlimited where it comes to 'number of open files'.