Search this Blog

Monday, November 28, 2011

Tablespace reorganization

In our datawarehouse we often run into the "problem" that tablespaces contain a lot of free space within the tablespace itself. Most of the time, this is caused by the fact that the warehouse programs first build up new objects and afterwards destroy old objects.
In the future we might need to have a more definitive solution for this, like converting the tablespaces to one with an uniform size.

However until then I use to do a tablespace reorganization. This means I pick a block_id of the tablespace datafile, somewhere around the point of total usage.

For this procedure to work, you need a tablespace that can temporary store the objects that lie beyond the free space in the tablespace you want to reclaim. This temporary store tablespace should NOT be an oracle temporary tablespace, but just a permanent tablespace that temporary can hold these objects.
It is preferred to create a seperate tablespace for this, as this can afterwards be removed again.

How the reorganizing works:
E.g If we have a tablespace of 10Gb in size and its datafile ( let's assume we have a single file ), has 4Gb of free space, I use the following query to determine the upper-bound block_id:

SQL> select /*+ rule */
               max(block_id)*0.7
               from dba_extents a,
               where a.tablespace_name='USERS' ;

If a tablespace has more datafiles, then you need to specify a certain datafile by adding the the file_id column in the query. In that case first look up the datafile number by joining v$tablespace and v$datafile.


After the block_id has been calculated, I use the following complex query to generate table move and index rebuild statements:



set pages 9999 lines 200
accept "Reorg tablespace : " &tabsource;
accept "Temporary store tablespace (not TEMP) : " &tabtemp;


select distinct cmd from 
(SELECT  /*+ rule */  
         'alter '
        || DECODE (segment_type
,                 'TABLE SUBPARTITION', 'TABLE'
,                 'TABLE PARTITION', 'TABLE'
,                 'INDEX', 'INDEX'
,                 'INDEX PARTITION', 'INDEX'
,                 'TABLE','TABLE'
,                 'TABLE'                   -- TABLE is the default                  
                    )
       ||' '
       ||owner
       ||'.'
       || segment_name
       || ' '
       || DECODE (segment_type, 'INDEX', 'REBUILD','INDEX PARTITION','REBUILD','MOVE')
       || ' '
       || DECODE (segment_type
,                 'TABLE SUBPARTITION', 'SUBPARTITION'
,                 'TABLE PARTITION', 'PARTITION'
,                 'INDEX PARTITION','PARTITION')
       || ' '
       || partition_name
       || ' tablespace '
       || &tabtemp
       || DECODE (segment_type, 'INDEX', ' ONLINE','INDEX PARTITION',' ONLINE')
       || ';' cmd
  FROM dba_extents
 WHERE tablespace_name = &tabsource
 --AND FILE_ID=10
 AND block_id > [Calulated Block_size]) ;


In the above query the use of FILE_ID is commented out. This because we assumed that we have only one datafile. Otherwise enter the file_id that you've found in v$datafile;


The result of this query is a lot of object-rebuild statements, that move tables and tablepartitions to the tablespace defined by &tabtemp. The indexes and index-partitions are online rebuild in the same tablespace.

Remember to SAVE the script, as you need it to move everything back to the original tablespace. 

Run the script outside business hours, as moving table objects will lock the tables,  invalidate indexes and all dependent packages/procedures.


As soon as the script is finished, purge the tablespace of its 'recyclebin' contents, by issuing as SYS:

SQL> purge tablespace USERS;

Afterwards the tablespace datafile can be shrunk to the desired size, preferably subtracting the size of the empty space in the beginning from the total datafile size.
As soon as that is finished, reopen the script you saved, edit it and change back all tablespace names, to the original source tablespace ( in this example USERS ).
Then rerun the script .

Finally use UTLRP to rebuild all invalidated packages/procedure and indexes. Also, if you created a temporary tablespace for this procedure, drop that again.

In our databases, this procedure usually frees lots of free space. Last week also I was able to return 29Gb of free space back to the OS.

Copyright: This script can be freely used. If you use it as an explanation on your own site/blog, a link to the source (my blog) is appreciated.


Tuesday, November 8, 2011

Which rights are Granted PUBLIC

During a schema copy action from an old database to a new one, we ran into problems that certain packages where not compiled successfully.
Some dependencies failed, because the schema user in the new database failed to have execution priviliged on certain SYS packages, like DBMS_LOCK.

The settings of both schema's where identical, so the rights must have been set in the past by using the PUBLIC account.
For those who don't know, the PUBLIC role is a role used for setting database global rights, which every user is allowed to have.


What I needed was a list of GRANT statements for the PUBLIC user that existed in the "old" database but not (yet) in the "new" database.
I started browsing the usual tables like DBA_ROLE_ROLES, DBA_ROLES, DBS_ROLE_PRIVS, but nowhere I could find the right for the PUBLIC role.


After spending almost an hour on Google, I came accross a set of views called KU$_ROGRANT_VIEW, KU_SYSGRANT_VIEW and KU$_OBJGRANT_VIEW

Especially the last one contained the information I was looking for. This table contains columns like 'GRANTEE', 'OBJ_NUM' and the privilege name 'PRIVNAME'.
Joining this table with DBA_OBJECTS, using the below query, resulted in -at least- the list with all EXECUTION grants for the PUBLIC role.
I was looking for these EXECUTION grants, because the uncompilable packages complained about dependencies, not about missing tables.

The statement:


SQL> SELECT    'grant '
       || a.privname
       || ' on '
       || b.owner
       || '.'
       || b.object_name
       || ' to '
       || a.grantee
       || ';'
  FROM KU$_OBJGRANT_VIEW a, dba_objects b
 WHERE     grantee = 'PUBLIC'
       AND a.obj_num = b.object_id
       AND a.privname='EXECUTE'
       AND b.object_name not like '%/%'
       order by b.object_name;

Using this statement on both the "old" and "new" database I was able to create a list of missing GRANT statements (there were 10 of them) that had to be executed on the "new" database.

Afterwards all packages were compiled successfully.

Friday, October 7, 2011

Unregister RMAN database directly from catalog DB

What DBA never moved databases to new hardware ??
And what DBA never forgot to unregister the database first from the RMAN catalog ??

If you didn't unregister the database after a database move, then at those crucial moments, months or years later, you might end up facing a problem when trying to e.g. restore that database from backup.

Imagine you're in the middle of disaster, the production database server crashed, management is breathing down your neck, and you need to restore this database as soon as possible.
You prepare a new server, install Oracle on it and immediately start the restore, but then disaster strikes again:


RMAN> run {
2> allocate channel ch0 device type 'sbt';
3> restore controlfile;
4> }


allocated channel: ch0
channel ch0: sid=147 instance=hkdrt1 devtype=SBT_TAPE
channel ch0: Veritas NetBackup for Oracle - Release 6.5 (2009050106)

Starting restore at 2009-DEC-03 11:57:20

released channel: ch0
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 12/03/2009 11:57:20
RMAN-06004: ORACLE error from recovery catalog database: RMAN-20005: target database name is ambiguous


The restore fails, because RMAN does not know which of the databases known in the recovery catalog with the given SID needs to be restored.
You should have unregistered the old copy of the database years back, when you moved the database to that "new" server which now is totally dead.
You should have -maybe- tried a restore back then to make sure it worked !!

But hey, even with the management in your back everything now goes smooth.
You have this blog guide to follow:


First of all:
Following this blog, you are directly administrating the RMAN catalog. Make sure you only follow the below procedure, do not manually delete records from any of the tables, or change any of the data, as you corrupt the catalog in doing so, rendering it totally useless 


How to continue?

We are going to use a database package that is installed in this catalog database. The package is called dbms_rcvcat and it has a procedure unregisterdatabase that cleans out all information about a database by providing a database key (db_key) and a database id (db_id).
The statement to do this looks like:

SQL> exec dbms_rcvcat.unregisterdatabase(db_key,db_id);

Now we only need to find these db_key and db_id values.
First connect to the database which contains the RMAN catalog using whatever SQLplus program you like.
Find in the catalog the list of databases know to the catalog with this SID:

SQL> select db_key, db_name, 
               reset_time, dbinc_status
               from dbinc where db_name = 'YOURSID';

The result is a list of registered databases with this SID name.


DB_KEY DB_NAME RESET_TIME DBINC_STATUS
411792
YOURSID 7-11-2002 14:29:22 PARENT
411792
YOURSID 24-10-2005 13:40:18 CURRENT
7455112
YOURSID 10-10-2008 21:41:20 CURRENT


In this example a couple of records are shown for the same SID. If correctly configured a database has only one CURRENT record in the catalog. A CURRENT record represents a database that is - according to the RMAN catalog - currently an active representation of a running database.
Since however a database can have only one CURRENT incarnation the one with the most recent RESET_TIME is the actual CURRENT database registration.
A PARENT incarnation is a database incarnation from which the current incarnation branched following an OPEN RESETLOGS operation. It is therefore not always the case that a PARENT incarnation exists.

In the above example the database registration with DB_KEY=7455112 seems to be the actual current database, as this one has the most recent reset_time.
This means we are going to unregister the database with the db_key equal to 411792;

Now we only need to find the DB_ID that belongs to this database:

SQL> select db_key, db_id 
               from db where db_key= 411792;

The result will look like:


DB_KEY DB_ID
411792
412102591


Now that we know the DB_KEY and the DB_ID, we can use the above mentioned procedure to unregister this database from the catalog:

SQL> exec dbms_rcvcat.unregisterdatabase(411792,412102591);

Don't worry, this statement can run for minutes ( up to 10-15 minutes on our system )!!

Afterwards re-run the select from the DBINC table again, and you'll see that the database registration has been removed from the catalog.
If a PARENT record existed for this DB_KEY, that record has also been deleted.


Conclusion:

The above procedure enables you to successfully clean out ( unregister ) databases from the RMAN catalog that do no longer exist.
This procedure also is a very risky procedure. Especially if one does now know what he is looking at, and expecially if the management team is breathing down your neck.
If you're in a disaster situation, where an immediate restore of a production environment is wanted, it is not the good time to do these kind of exercises.
If and when you enter by mistake the DB_KEY and DB_ID of the latest CURRENT incarnation ( the database you need to restore ) and you accidently press [Enter], then immediately start thinking about another job !!
Any restore of that particular database by means of RMAN then is no longer possible!!




Thursday, September 29, 2011

ORA-27123: Unable to attach a shared memory segment

This morning we had another brainteaser starting one of our databases.

The situiation was as following:

We have TEST/DEV server with a single Oracle 9.2.0.8 installation on it, installed as user "oracle"
Next to that there are several application users, called appt01, appt02, appt03 and appt04, all using this same ORACLE_HOME for their database.

All databases were running fine this morning. 
We shutdown the appt01 to install a patch. 

When trying to restart the database, suddenly this error appeared:

SQL> startup 
ORA-27123: Unable to attach a shared memory segment 
IBM AIX RISC System/6000 Error: 13: Permission denied


Using Google and MOS docs (115753.1 &167250.1), all information pointed to the incorrect file permissions of the "oracle" binary where the sticky bit has to be set. However this could be possible, because the patch installation rebuild this binary, it felt as not the correct reason. The file permssions were correct when looking at the file:

oracle@devrsv1% cd$ORACLE_HOME
oracle@devrsv1% ls -l ./bin/oracle
-rwsr-s--x    1 oracle   dba     70518002 Sep 29 12:09 ./bin/oracle

Investigating even more it was soon discovered that all other three database were able to stop and restart as expected. They showed no problems at all, even though they use the same ORACLE_HOME and settings.

After hours of searching and looking at the system, I finally turned to the OS.
Because the error message showed something about shared memory, I started investigating into that direction.
As user root, we executed:

# ipcs -mob

This statement usually shows all active shared memory segments, their sizes and the owners of the segments. In our case, only the output header appeared:


IPC status from /dev/mem as of Thu Sep 29 14:17:01 DFT 2011
T        ID     KEY        MODE       OWNER    GROUP NATTCH     SEGSZ


Now this showed that something was really wrong with the OS shared memory administration.
The other three databases were running, but also from these three no shared memory was shown, not even mentioning the standard OS shared memory pools.

We then decided to restart the complete server. And that solved the problem.
After the restart all databases started again as expected. Also the shared memory segments were correctly shown.
So even the error message displayed, usually points into the direction of the 'sticky bit' of the oracle binary, the problem can also be caused by a total shared memory problem of the system, caused by whatever reason.

Wednesday, September 28, 2011

Database Index administration

I would like to share some PL/SQL code that enables us DBA's to do index maintenance. In our data warehouse there were a couple of procedures that saved very basic index information.
Then the indexes where dropped and after the warehouse load they were recreated.
The pitfall of these simple procedures was that they were not able to cope with partitioned indexes and other complex structures. These more complex indexes were rebuild as normal non-partitioned indexes afterwards.

To be able to also rebuild partitioned and other index structures I rewrote the complete procedure, making effective use of the DBMS_METADATA package available in the SYS account.

This DBMS_METADATA package has a function called GET_DDL that is able to return the DDL of any object in the database, like tablespaces, tables, indexes and others.
If you e.g. would want to know the DDL of the EMP table, simple use the following query:

SQL > select dbms_metadata.get_ddl ('TABLE','EMP') from dual;

Ofcourse the above statement needs to be executed as the owner of the table. If you want to execute it as SYS then you must supply the function with the table owner, like:

SQL> select sys.dbms_metadata.get_ddl('TABLE','EMP','SCOTT') from dual;

The result is a CLOB variable containing the DDL to (re)create the EMP table of user SCOTT:


 CREATE TABLE "SCOTT"."EMP" 
   ( "EMPNO" NUMBER(4,0) NOT NULL ENABLE, 
"ENAME" VARCHAR2(10 CHAR), 
"JOB" VARCHAR2(9 CHAR), 
"MGR" NUMBER(4,0), 
"HIREDATE" DATE, 
"SAL" NUMBER(7,2), 
"COMM" NUMBER(7,2), 
"DEPTNO" NUMBER(2,0), 
CONSTRAINT "EMP_VALID_JOB" CHECK (job in ('CLERK','SALESMAN','MANAGER','ANALYST','PRESIDENT')) ENABLE, 
PRIMARY KEY ("EMPNO")
  USING INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS 
  STORAGE(INITIAL 65536 NEXT 16384 MINEXTENTS 1 MAXEXTENTS 1017
  PCTINCREASE 50 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
  TABLESPACE "SYSTEM"  ENABLE
   ) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 NOCOMPRESS LOGGING
  STORAGE(INITIAL 65536 NEXT 16384 MINEXTENTS 1 MAXEXTENTS 1017
  PCTINCREASE 50 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
  TABLESPACE "SYSTEM"



The procedures I wrote uses a simple table for administration purposes. This table exists of four columns:
INDEX_NAME of type VARCHAR(30)
TABLE_NAME of type VARCHAR(30)
QUERY of type CLOB
STMT_ORDER of type NUMBER(3)

The first three column are quite self-explaining. They contain the table_name, and index_name that is being dropped, and of course the query that is needed to recreate the index afterwards.
The last column STMT_ORDER determines the order of queries. This is important e.g. when creating primary-key indexes with extra constraints. The primary key needs to exist, before a constraint can be added to it.

The DDL structure of this table looks like:

CREATE TABLE WBX_DDL_INDEXES
(
   INDEX_NAME     VARCHAR2 (30 CHAR)
,  TABLE_NAME     VARCHAR2 (30 CHAR)
,  QUERY          CLOB
,  STMT_ORDER     NUMBER (3)
)
LOB (
   QUERY)
   STORE AS
   (
      TABLESPACE USERS
      ENABLE STORAGE IN ROW
      CHUNK 16384
      RETENTION
      NOCACHE LOGGING)
TABLESPACE USERS
LOGGING
NOCOMPRESS
NOCACHE
NOPARALLEL
MONITORING;


CREATE UNIQUE INDEX WBX_DDL_INDEXES_PK
   ON WBX_DDL_INDEXES (INDEX_NAME, TABLE_NAME, STMT_ORDER)
   NOLOGGING
   TABLESPACE TOOLS
   NOPARALLEL;


ALTER TABLE WBX_DDL_INDEXES ADD 
    ( CONSTRAINT WBX_DDL_INDEXES_PK PRIMARY KEY 
      (INDEX_NAME, TABLE_NAME, STMT_ORDER) 
      USING INDEX WBX_DDL_INDEXES_PK ENABLE VALIDATE);

There are two procedures written, the first responsible for dropping a given index, the second for recreating it. Note that the procedure take a table_name as argument, where one would maybe expect a table_name, index_name combination.
However, since we are using this is a large datawarehouse, all indexes of a given table are dropped. We therefore don't need to specify the index_name.
The "DROP_INDEX" procedure looks like:

  procedure drop_index
    (p_table      in     varchar2
    )
   is
    q clob;
    l_count number(3);


    -- Use a cursor with a complex statement to determine all indexes and their types
    cursor c1(p_table varchar2) 
    is
        select ui.index_name, ui.table_name, ui.index_type, ui.uniqueness, uc.constraint_type
        from user_indexes ui
        left outer join user_constraints uc
        on ui.index_name=uc.index_name 
        and ui.table_name=uc.table_name
        where ui.table_name=p_table
        and ui.index_type <> 'LOB'
        order by uc.constraint_type;
begin
    l_count := 1;
    for r1 in c1(p_table) 
    loop
        if r1.constraint_type = 'P'
        then
            q := sys.dbms_metadata.get_ddl('CONSTRAINT', r1.index_name);
            begin
                insert into wbx_ddl_indexes (index_name, table_name, query, stmt_order)
                values ( r1.index_name, p_table, q, l_count);
                exception
                    when dup_val_on_index then
                         null;
            end;
            -- if the index is a primary key, drop it using an "ALTER TABLE.."
            execute immediate 'alter table '||p_table||' drop primary key';   
        else 
            q := sys.dbms_metadata.get_ddl('INDEX', r1.index_name);
            begin
                insert into wbx_ddl_indexes (index_name, table_name, query, stmt_order)
                values ( r1.index_name, p_table, q, l_count);
                exception
                when dup_val_on_index then
                     null;
            end;
            -- if the index is a normal index, use the "DROP INDEX" statement
            execute immediate 'drop index '||r1.index_name;
        end if;


        l_count := l_count+1;        
        commit;               
     end loop;
end; 
     
The "CREATE_INDEX" procedure looks like:

  procedure create_index
    (p_table      in     varchar2
    )
   is
    q clob;
    exec_q varchar2(32767);
    l_index varchar2(30);
    l_length number(10);
    -- Use a cursor to go through all saved index statements
    cursor c1(l_table varchar2) 
    is 
        select index_name, query,stmt_order
        from wbx_ddl_indexes 
        where table_name=l_table
        order by stmt_order asc;
     
begin
    for r1 in c1(p_table) 
    loop
         begin
             select index_name
             into l_index
             from user_indexes where table_name=p_table and index_name=r1.index_name;
         
             exception 
                when no_data_found then
                -- Index does not exist, create it
                select query into exec_q 
                    from wbx_ddl_indexes
                    where index_name=r1.index_name
                    and table_name=p_table
                    and stmt_order=r1.stmt_order; 
                
                execute immediate exec_q;         
         end;
         
         -- As soon as the index is created, update the administration
         delete
           from   wbx_ddl_indexes  
           where  index_name = r1.index_name 
           and table_name=p_table
           and stmt_order=r1.stmt_order;


        commit;
   end loop;
end;


It is too much to exactly explain the procedure here themselves. Anyone who can read PL/SQL code should be able to understand the procedures.

Feel free however to post questions or send me an Email.

Thursday, July 7, 2011

Database SAN interactions

I would like to address a problem we are seeing on our systems, that is totally impossible from the viewpoint of SAN engineers.


The situation:

We have two Oracle databases.
The first one is an Oracle 10.2.0.4 databas
e, running on a Logical Partition (LPAR) of an IBM pSeries-7 server. The storage of this database is located on an IBM DS6800 SAN box, connected by Fiber cards to this server

The second database is an Oracle 11.
1.0.7.4 two node RAC database, running on two physically seperated IBM pSeries-6 servers. The storage of this database is in ASM, however located on the same IBM DS6800 SAN Storage server.


So, we have two database, totally physically seperated from each other. The only part they do share is the SAN box.


Now the problem:

If we put load on database 1 ( this is a Datawarehouse database ), database 2 starts suffering from it. Let's have a look the graphs to see what I mean:

The load on database 1:
The effect on database 2:


(Please Note that the graphs are not perfectly aligned in this blog)

The effect is thus that if database 1 is especially busy with LGWR ( system I/O ), database 2 starts to show I/O waits, seen as more and more active sessions with the dark-blue I/O color in Grid Control. The effect stops as soon as database 1 reduces (finishes) its LGWR activity.


Another strange effect we've seen is coming from another database "database3". This database is yet on another server (IBM pSeries-5), but also has its database on the same DS6800 storage server.
This database, an "old" 9.2.0.8 database, used to do a forced log-switch every 15 minutes, by means of a crontab script. This forced log-switch showed up as spikes in Grid Control of this database ( the orange "commit" color).
Again these spikes also reflect themselves as I/O waits on the RAC (database 2).
Even if we changed the execute time in the crontab of database 3, the spikes on database 2 moved accordingly. Because the application running on the RAC is our Webshop, suffers badly from these I/O waits, we turned these forced log-switches of.


Now, has anybody of you DBA's or Storage guys out there seen this kind of behaviour. The storage guys here at the company can't believe that database affect each other like this by means of the SAN. The SAN has a huge cache, it is tuned, etc...etc.., so theoretically this kind of behaviour is not possible.
In my opinion the LGWR "log-sync" actions cause the storage server to wait for a synchroneous action from the database, thus freezing or delaying all other actions with it.

Any other ideas are welcome. Solutions of course are even more welcome

Thanks



Add On: As of this week ( oct. 5th 2011 ), the storage of database 2 has been moved to a newly purchased SAN storage, totally isolating it from the other storage ( and thus the othter databases).
The inheritence effect now is gone !!


Tuesday, May 3, 2011

Grid Control 11g screens freeze

Last year we installed Oracle Enterprise Manager Grid Control 11g.
The main reason was the fact that we were (still) on version 10.2.0.3 and we experienced some troubles with that version. Also Oracle anounced the problem with this version regarding the end-of-life of the SSL-certificate on Dec 31th 2010.

The installation went very smoothly. See also my other posts about this subject.

The only problem we encountered with this new version was that the 'Performance Graphs' froze after a database restart of one of the monitored databases.
It is expected that there is no response anymore for the graph to build up, after a database shutdown, but you would expect that the graph just shows nothing, but keeps advancing in time.

This was not the case.

Also after the target database was restarted, the graph still kept showing the status before the shutdown. Nothing could be done to get over this, except a complete restart of Grid Control.
After such a complete restart, the graph refreshed and started showing real-time information again.


We started a Service Request at Oracle Support for this and thanks to DITO ( Demo It To Oracle) we were able to clearly show to Oracle support what happened. With CamStudio I recorded a complete movie about the problem and send it to Oracle Support.
They immediately recognized the problem as being bug 10307099 and supplied a patch for it.

This solved the problem.
I was very happy to see the power of CamStudio to help me in clearly showing/explaining the problem to Oracle Support. This encouraged me to use it in other cases as well. And it is free !!!


Monday, March 21, 2011

After node restart RAC cluster does not start

Today we have been very busy trying to figure out why after a system restart the RAC nodes didn't want to start the cluster stack anymore.

The biggest problem we had was there was absolutely no logging at all !!
The system seemed totally not startable.

All checks to get the cluster online where -in our opinion- successfull.
We were able to create an ocrdump, and query the voting disks.

Finally we started to dig into the processes that were running.
We found a process called:

/etc/init.cssd startcheck

This process seemed to hang. Waiting for something.
Looking into the script we read that this function checked for all needed resources to be available, and as long as they were not available, went for a sleep of 60 seconds.
We detected it used the AIX logging system, of which we however did not "catch" the logging messages by means of the syslog.conf file.

After a while we decided to start a "startcheck" of our own, using the debug options op de Korn shell:

# ksh -x /etc/init.cssd startcheck

The result was that it showed that there was some logging in files in /tmp calles cssxxxx where xxxx is a numeric value.

Looking into the last one of these files, it showed that the votingdisks where missing.
This is contradiction to the

# $CRS_HOME/bin/crsctl query votedisk

which still showed everything was fine with the voting disks.

The real problem was in the fact that the 'crsctl' command, just looks for the device files to be present and to be readable.
If there were/are actually disks attached to these device files was not checked.

It turned out that a major network problem we've had two days earlier disturbed something on the SAN network, making it not possible anymore to connect the logical drives to this server.


What I totally do not like here is the total lack of any logging in the usual logging location of CRS, being $CRS_HOME/log/.
Maybe in future releases ( we are using 11.1.0.7) this problem is better handled.


Thursday, January 20, 2011

Copying Archived Logfile from ASM to a file system

Following OTN tread http://forums.oracle.com/forums/thread.jspa?threadID=2154801&tstart=0 about copying archived logfiles from ASM to a UNIX filesystem, I am posting a script we used in the past.
Nowadays we use RMAN to get the files from ASM directly into a TSM diskpool, but back then we used another method for putting the archived logfiles on backup, which resulted in the need for this script:


# Script to copy archived logfiles from ASM to a file system on UNIX
# In this case all archived logfiles found in v$archived_log are copied to a staging mount point /opt/asmprd/ar01
# The username/password used to access the database need to be set before use of this script.
# For sake of security I removed these lines from this script
#
# The script needs to be run as owner of the ASM instance
#
#
export ORACLE_SID=WCSPRD
export FSFULL=FALSE
export ORACLE_BASE=/opt/$LOGNAME/ora
export CRS_HOME=/opt/crsprd/ora/11.1.0
export ORA_CRS_HOME=$CRS_HOME
export ASM_HOME=/opt/asmprd/ora/11.1.0
export ORA_ASM_HOME=$ASM_HOME
export ORACLE_HOME=$ORACLE_BASE/11.1.0
export LD_LIBRARY_PATH=$ORACLE_HOME/lib:$CRS_HOME/lib:$ORACLE_HOME/lib32:$CRS_HOME/lib32
export LIBPATH=$LD_LIBRARY_PATH
export PATH=$ORACLE_HOME/bin:$CRS_HOME/bin:$PATH
export TNS_ADMIN=/opt/oraadmin/network/admin
export ORACLE_SID=+ASM
export HOST=`hostname`
export LOG_FILE=/var/adm/log/StageArchives_${HOST}.log
export A_FILE_COPIED=FALSE
#
# Optionally turn Debug on to see what the script does
export Debug=false
#
# Functions
#
Log() {
echo "`date +%Y/%m/%d` (`date +%H:%M:%S`) : $1" | tee -a $LOG_FILE
} #End Log
#
#
Continue()
{
echo "Continue....? [n]\c"; read ans
case $ans in
y|Y) echo "\n\nContinuing...."
break;;
*) exit 0
break;;
esac
unset ans
}
#
#
GetArchiveList()
{
echo "set termout off heading off pages 9999 lines 70 feedback off" > /tmp/getArchives.$$.sql
echo "column name format A70;" >> /tmp/getArchives.$$.sql
echo "spool /tmp/archivelist.lst;" >> /tmp/getArchives.$$.sql
echo "select trim(name) from v\$archived_log where deleted='NO' and archived='YES';" >> /tmp/getArchives.$$.sql
echo "quit;" >> /tmp/getArchives.$$.sql
$ORACLE_HOME/bin/sqlplus -S ${WCSPRD_SYS}@WCSPRD as sysdba @/tmp/getArchives.$$.sql

if [ "$Debug" = "true" ]
then
cat /tmp/getArchives.$$.sql
Continue
fi
rm -f /tmp/getArchives.$$.sql
}


StoreArchivesInFS()
{
ArchName=$1
ArchShortName=`basename $1`
ArchDest=/opt/asmprd/ar01/${ArchShortName}
AvailFS=`df -gv /opt/asmprd/ar01 | grep -v Free| awk '{print $5}' | cut -d'%' -f1`
Log "Archiving $ArchDest"

if [ "$Debug" = "true" ]
then
echo "AvailFS = $AvailFS"
echo "Commando = $ASM_HOME/bin/asmcmd cp $ArchName $ArchDest"
Continue
fi
if [ $AvailFS -lt 90 ]
then
echo "$ASM_HOME/bin/asmcmd cp $ArchName $ArchDest"
$ASM_HOME/bin/asmcmd cp $ArchName $ArchDest >/dev/null 2>&1

if [ -f "$ArchDest" ]
then
export A_FILE_COPIED=TRUE
$ASM_HOME/bin/asmcmd rm $ArchName >/dev/null 2>&1
Log "$ArchName has been deleted from ASM."
else
echo "ArchDest=$ArchDest"
ls -l $ArchDest
fi
else
export FSFULL=TRUE
fi
}


RmanCrosscheck()
{
# Crosscheck and clean repository
#Log "Crosscheck and clean repository"
$ORACLE_HOME/bin/rman target ${WCSPRD_SYS}@WCSPRD rcvcat ${TSMORA_TSMCAT}@tsmora cmdfile StageArchives.cmd
d
}


# Main Program
PROG_PID=$$
RUNNING=`ps -ef| grep StageArchives.sh | grep -v ${PROG_PID} | grep -v grep | awk '{print $2}'`
if [ -n "$RUNNING" ]
then
Log "Old Running session(s) found with PID(s): $RUNNING "
ps -ef| grep ${RUNNING} | grep -v grep >> $LOG_FILE
for tpid in $RUNNING
{
kill $tpid
}
fi
GetArchiveList
cat /tmp/archivelist.lst | while read ARCHFILE
do
if [ ! "$ARCHFILE" = "" ]
then
StoreArchivesInFS $ARCHFILE
if [ "$FSFULL" = "TRUE" ]
then
#
# Insert some action here to clean up the destination file system
fi
fi
done
#
#
#

#
# Use Rman to update the recovery catalog
RmanCrosscheck

unset FSFULL ARCHFILE
#rm -f /tmp/archivelist.lst


The script contains a call to 'rman' using the command file 'StageArchives.cmd'
This file looks like:

allocate channel for maintenance type disk;
crosscheck archivelog all;
delete noprompt expired archivelog all;
quit;


Please note that I am not able to give any support to anyone who wants to use this script.
You are allowed to copy and use it, but for the remainder you're on your own.


Success!

Tuesday, January 11, 2011

Service XDB is down alert in Grid Control

Problem:
GridControl is reporting the XDB service as being down for RAC database.
In this thread it is investigated why this alert is thrown. We are using a RAC database called RACTST to show all settings and SQL queries.
It must be said that we haven't seen this alert being thrown for single-server databases. The Metric isn't even available in Grid Control there.


Background:
When creating a database with 'dbca' one can choose whether to use shared server setup. Even when shared server is disabled during database creation, afterwards the parameter 'dispatchers' is still set.
For anyone who is unfamiliar with the shared server term, consult the Oracle Database Administrators Guide at: http://download.oracle.com/docs/cd/B28359_01/server.111/b28310/manproc003.htm#i1010000


What do we see:
The default value for the dispatchers parameter looks like:
dispatchers='(PROTOCOL=TCP) (SERVICE=RACTSTXDB)'

In this parameter the service 'RACTSTXDB' is configured. This service then is automatically made available. The service availability can be checked by querying the gv$services table:

SQL> select inst_id,name from gv$services where name='RACTSTXDB';
INST_ID NAME
---------- ----------------------------------------------------------------
1 RACTSTXDB
2 RACTSTXDB

Because this is a 2-node RAC it is clearly shown that the service is available on both nodes.


When does the alert occur ?
Most DBA's disable the 'dispatchers' parameter whenever they choose to not use shared servers. Disabling the dispatcher is easily done by resetting the parameter from the spfile:

SQL> alter system reset dispatchers scope=spfile sid='*';

System altered.

It is also possible to turn of the dispatchers dynamically afterwards, by per using:

SQL> alter system reset dispatchers scope=memory sid='';

One can also restart the instance(s) to turn of the parameter after the spfile change.


Now wait for approx. 5 minutes, and voila, there is the alert:






Clicking on the alert shows a bit more detail, about when the alert occured:












A check in gv$services confirms the service being 'down' ( in fact it is gone):

SQL> select inst_id,name from gv$services where name='RACTSTXDB';

no rows selected.


The service is however still available in the database. A select from dba_services, still shows is as being known:

SQL> select name from dba_services where name='RACTSTXDB';
NAME
----------------------------------------------------------------
RACTSTXDB
1 row selected.


Solution:
Disabling the alert in GridControl would be a permanent "solution".
But that's more like a non-DBA workaround. You wouldn't be warned about any other real service being down.

There are three solutions to get rid of the alert:
- Either re-enable the dispatcher, or
- Remove the service from the database, or
- Learn to live with the alert being there.

The first is easy. Put back the 'dispatchers' parameter in the spfile with 'scope=both', and you're done. But stubborn as we are, DBA's never give up.

Well the second step, as said in the 'Problem' section, the alert is only thrown for RAC databases. For single instances, the Metric isn't even available in Grid Control.
In the RAC-FAQ on Oracle Support it says:

When I look at ALL_SERVICES view in my database I see services I did not create, what are they for?
You will always see a default database service that is the same name as your database. This service is available on all instances in the cluster. You will also see two services used by the database SYS$BACKGROUND (for background processes) and SYS$USERS (users who connect via BEQ or without using a service_name). You may also see services that end with XDB which are created for the XML DB feature and you will not be able to manage these services.


The meaning of the last phrase is that this service is not manageable with srvctl. It is unknown as a cluster resource. Is also says that belongs to the XML DB feature, so before proceeding with this thread consult MOS Note 416312.1 and the Oracle Forums thread: http://forums.oracle.com/forums/thread.jspa?threadID=410714&tstart=0


If you're sure you're not using XML-Db features, then you are OK to drop the XDB service.
Use the following command to do so:

SQL> exec dbms_service.delete_service('RACTSTXDB');

PL/SQL procedure successfully completed.

This will drop the service. When done the alert in Grid Control will subsequently be cleared.
And ofcourse as a real DBA we first test this on our TEST systems, don't we ??


The last option 'Learn to live with it' is valid for those users that do use the XML-Db feature and are not able to drop the service. For them I would almost recommend to follow solution 1. It is easier to cope with a running but unused dispatcher, then with an Grid Control alert that sits in your face every day.



Disclaimer:
I nowhere found a reason why we should not be able to drop this service if we are not using XML-Db features. Therefore the real impact is not completely known. If anyone can elaborate about this subject, you're most welcome to do so.
And eh... if you mess up your production system by following the actions in this thread, start blaming yourself, not me. All above is done on a TEST server.