Search this Blog

Friday, December 3, 2010

AIX Agent11g patch => Bug Solved!!

09-12-2010:
Oracle has fixed the "below" bug and release a new Agent11g installation package for AIX on OTN. It is preferred to use this new package above the workaround below.


03-12-2010:
Regarding the Agent install error:

Checking Temp space: must be greater than 150 MB. Actual 541 MB Passed
Checking swap space: must be greater than 150 MB. Actual 4096 MB Passed
Preparing to launch Oracle Universal Installer from /tmp/OraInstall2010-12-02_02-01-58PM. Please wait ...xxxxx agent> Exception in thread "main" java.lang.NoClassDefFoundError: oracle.sysman.oii.oiix.OiixNetOps
at oracle.sysman.install.oneclick.EMGCInstallStaticVariables.(EMGCInstallStaticVariables.java:119)
at java.lang.J9VMInternals.initializeImpl(Native Method)
at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
at oracle.sysman.install.oneclick.EMGCInstaller.parseCommandLineArgs(EMGCInstaller.java:506)
at oracle.sysman.install.oneclick.EMGCInstaller.init(EMGCInstaller.java:204)
at oracle.sysman.install.oneclick.EMGCInstaller.main(EMGCInstaller.java:1042)
Caused by: java.lang.ClassNotFoundException: oracle.sysman.oii.oiix.OiixNetOps
at java.net.URLClassLoader.findClass(URLClassLoader.java:421)
at java.lang.ClassLoader.loadClass(ClassLoader.java:643)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:346)
at java.lang.ClassLoader.loadClass(ClassLoader.java:609)
... 6 more


Oracle support has supplied us with a workaround TAR file.
You can find it here: http://www.mediafire.com/file/17apim8h8lci88q/ouicore.tar

The workaround steps are:

=== ODM Action Plan ===
1. Download the ouicore.tar file attached to the SR.
2. cp ouicore.tar
3. cd
4. tar -xvf ouicore.tar
5. ls -al aix/agent/stage/Components/oracle.swd.oui.core/11.1.0.8.0/1/DataFiles
-> this should show 5 filegroup jar files and directory Expanded
6. Start the installation

Success !!!

Thursday, December 2, 2010

OMSCA Fails during installation of Grid Control 11g on AIX

During the installation of Grid Control 11g on AIX 5.3TL10 we saw that during the configuration process the OMS Configuration Assistant (omsca) failed for an unknown reason.

I would like to bring THE solution to the attention of all you installers.
The installation can be found, thanks to Arjun Raja on his blog: http://arjunraja.wordpress.com/

Please note that this solution is only valid for AIX systems

Friday, November 12, 2010

RMAN 11g suddenly fails on AIX, nothing changed

Okay, after 1.5 hours digging into the system (and my brains) we have solved yet another mysterious RMAN problem.


What was the case?

After a successfull backup last night, rman suddenly fails to allocate a tape channel while trying to backup archivelogs. The tape channel in this case points to a TSM management layer.
The error message shown was:


RMAN> run {
2> allocate channel ch1 type 'sbt_tape' parms
3> 'ENV=(TDPO_OPTFILE=/usr/tivoli/tsm/client/oracle/bin64/tdpo.opt)';
4> backup
5> incremental level 0
6> tag arch_bk
7> format 'arch_%d_t%t_s%s_p%p'
8> (archivelog all delete input );
9> sql 'ALTER SYSTEM ARCHIVE LOG CURRENT';
10> }
11>
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of allocate command on ch1 channel at 11/12/2010 08:34:42
ORA-19554: error allocating device, device type: SBT_TAPE, device name:
ORA-27000: skgfqsbi: failed to initialize storage subsystem (SBT) layer
IBM AIX RISC System/6000 Error: 106: Reserved errno was encountered
Additional information: 7011
ORA-19511: Error received from media manager layer, error text:
SBT error = 7011, errno = 106, sbtopen: system error

When searching on Google or Oracle Support, the combination of ORA-19554, ORA-27000,ORA-27000 and IBM AIX RISC System/6000 Error: 106 usually point to problems with writing the TSM logfiles ( tdpoerror.log, dsmerror.log). Most of the problem then are solved by setting correct permissions for these files, making the Oracle owner able to write these files.

In our case the permissions were OK, and as said above the backup of last night had run successfully.
Trying manually to establish a channel connection also failed.

At this point I had discussedthe problem with the tape-storage guys about what could be wrong here. They told me TSM was working OK, no problem whatsoever.
They agreed it had to be something on the Oracle server, simply because of the fact RMAN suddenly stopped working, even when nothing had changed.

In the tdpo.opt file I suddenly found the TDPO_TRACE_FILE option be turned on.
This tracefile pointed to a file called 'tdpo.out'

In this file I saw the following entry:

11/12/10 08:27:27.086 [1531908] [1] : init2.cpp ( 359): sbtinit(): Rman Trace File: '/opt/wcsprd/ora/diag/rdbms/wcsprd/WCSPRD1/trace/sbtio.log'

This meant that the one attempt this morning that succeeded to allocate a channel manually, logged this entry in the trace file.
All other attempts allocating a channel hadn't logged this message.

This combined with the :

IBM AIX RISC System/6000 Error: 106: Reserved errno was encountered
Additional information: 7011
ORA-19511: Error received from media manager layer, error text:
SBT error = 7011, errno = 106, sbtopen: system error

I went looking into the trace directory where sbtio.log was located.
In this directory a list of 6886 tracefiles was found. In my opinion this should not be a problem while trying to open a file there.

However it was the cause of the problem !!!

After cleaning this directory, using the adrci tool, RMAN now works again.


Conclusion:

RMAN is a very powerfull tool for making online backups.
It also sometimes is a very misterious tool.
It sometimes stops working when nothing has changed in the configuration, displaying an error message that is not always very clear.
The Automatic Diagnostic Repository (ADR) is a very usefull feature of Oracle 11g. It centralizes all database related logfiles into one location, making problem searching easier.
That this tool however can turn into problems for RMAN, I never would have expected.
Luckily Oracle has included the ADRCI ( ADR commandline interface ) that enables the DBA to easily clean out the ADR.
We now have scheduled a daily cleanout of the ADR, simply to avoid RMAN problems


Finally one remark:
In my honest opinion the fact that there were only 6886 files in the trace directory, should not have caused such a problem for RMAN. The Oracle owner's ulimit permissions are all set to unlimited where it comes to 'number of open files'.







Friday, October 15, 2010

Under the hood of ASM

With Oracle ASM the Oracle DBA has a new very powerfull storage medium where he/she can put the database files.
However ASM already was introduced in Oracle 10g, for a lot of DBA's it is still a black-box with only very small holes in it where you can peek in.

In the beginning, I must confess I was very suspicious about ASM, especially because of being a black-box.
However it occured to me very soon that it looks like a black-box because there is nothing much to see.
ASM is very limited in what it shows to the DBA, and even now I think the word 'limited' does not fit here. The word 'basic' is more fitting. ASM shows only basic information of what it is housing and what it is doing.
For showing this information ASM has a couple of views available. These are the only views available. In contradiction to a "normal" database instance, all dictionary views are not available in ASM instances.
The following views are available:


View NameDescription
V$ASM_DISKGROUPLists all available diskgroups with their characteristics
V$ASM_DISKGROUP_STATSynonym for v$asm_diskgroup
V$ASM_DISKLists all disks available to ASM that match the ASM disk_string parameter
V$ASM_DISK_STATSynonym for v$asm_disk
V$ASM_FILEContains an entry for every file stored in ASM
V$ASM_ALIASLists the known file names that link to ASM files
V$ASM_CLIENTLists the databases that connect to this ASM instance
V$OPERATIONLists long running ASM operations, like rebalance actions


Last week we had an issue, where ASM showed very high I/O activity on 2 of the 9 disks in a diskgroup. The behaviour could not be explained by the activity in the application. It looked like ASM was either doing a large rebalance on its own, or we were slowly running into a problem with the spreading of the data.

Unfortunately the above views can't give you an overview about where the data really is put, i.e. on which disk.
As everyone knows ASM uses striping technology, to spread the data across all disks.
It however looked like we had hit some hot block in a table(space) that was only stored on these two disks.

Using my most famous library "Google", after a while I found a kind of hidden view. That is, all above views have a system view which has the similar information.
There is however 1 view that hasn't a 'user readable' alias.

This view contains information about the striping of the database objects across the disks of a diskgroup. This view is called x$kffxp and it can be joined with the other views to get the needed information.
However our "problem" stopped after a couple of hours I did check the seperation of the serveral tablespace blocks across the disks. There was nothing abnormal found, however if there had been something abnormal, it might be that the "problem" was indeed a rebalance, solving this inbalance.

I used the following script to check the balance of the data across the disks:

set pages 9999 lines 200
column name format a40
select a.name, b.disk_kffxp disk, count(disk_kffxp) blocks
from
v$asm_alias a
, x$kffxp b
, v$asm_file c
where
a.group_number=b.group_kffxp
and a.group_number=c.group_number
and a.file_number=c.file_number
and a.file_number=b.number_kffxp
and c.type in ('DATAFILE','TEMPFILE','ONLINELOG')
group by a.name, b.disk_kffxp
order by a.name, count(disk_kffxp) desc;

Output then could look like ( for the SYSAUX tablespace ):

NAME DISKNR BLOCKS
SYSAUX.338.715000059 6 242
SYSAUX.338.715000059 7 229
SYSAUX.338.715000059 2 218
SYSAUX.338.715000059 5 216
SYSAUX.338.715000059 0 213
SYSAUX.338.715000059 1 211
SYSAUX.338.715000059 4 211
SYSAUX.338.715000059 3 207
SYSAUX.338.715000059 8 181

Conclusion:
The SYSAUX tablespace is reasonably balanced, although the difference between disk 8 and disk 6 is more than 40 blocks stored.

Thursday, October 14, 2010

NOTE: Unident of disk

In our tracefile of the ASM rebalancer process, every 5 minutes the following line is shown:

NOTE:Unident of disk:/dev/asmspfile

There is not really much information to be found on what this can mean, even on My Oracle Support there is no information found.

I am guessing this message is explained by the fact that the diskname matches our asm_diskstring:

SQL> show parameter asm_diskstring

NAME TYPE VALUE
------------------------------
asm_diskstring string /dev/asm*


but in fact is not part of any (active) diskgroup
In our case this disk, or raw device, contains the shared spfile of the ASM instances across our RAC nodes.

If this assumption is true, then this message can be safely ignored.


What is the risk:

If you don't keep a good administration of what disks have which function, or you mess up with the disk naming convention anyway, the risk exists that you'll add disks to a diskgroup that might be in use for other functions.
Like in our case, if I would not know this disk contains the ASM spfile, it would be very easy to add that disk to a diskgroup. And I doubt if ASM would complain doing so.

Thursday, September 23, 2010

Using Oracle Configuration Manager

In this section the installation/configuration of Oracle's Configuration Manager is explained.

What is this Configuration Manager?

Oracle's Configuration Manager is a tool that enables users to link their Oracle database configuration to an account on the Oracle Support site.

The advantages are huge !!

By linking your system to the Oracle Support Site the system is analyzed, checked and monitored constantly. This means that Oracle Support is able to warn you up front of any problem, risk or abnomality in your configuration.

Next to that the Oracle Support Site offers you - the customer - the possibility to browse your own system in a graphical manner.
If you browse to the Support Site a Window is shown, called "Systems".

This window will be empty in the beginning, because you didn't link any systems to Oracle Support before as shown here:






Also other windows like 'Patch and Health Recommendations' show no information.


Setup Configuration Manager:

In order for configuration Manager to start doing his job, make sure that:

  • There is an Internet connection from the database server configured.
  • You are logged in as the owner of the Oracle database.
Browse to the directory $ORACLE_HOME/ccr.
This is the base directory for the Configuration Manager.

$ cd $ORACLE_HOME/ccr

Now start the Setup of the Configuration Manager :

$ ./bin/setupCCR

The system will ask you for a valid Oracle Support Account. This account should be able to manage systems on the MOS site.

Provide your email address to be informed of security issues, install and
initiate Oracle Configuration Manager. Easier for you if you use your MetaLink
Email address/User Name.
Visit http://www.oracle.com/support/policies.html for details.
Email address/User Name: myaccount@mycompany.com
Provide your MetaLink password to receive security updates via your MetaLink account.
Password (optional):


The system starts configuring "Configuration Manager"

** Installing base package **
Deploying core - Version 10.3.0.1.0

** Registering installation with Oracle Configuration Manager server(s) **
Deploying engines - Version 10.2.7.1.0
Deploying metricdata - Version 10.2.4.0.3
Deploying scripts - Version 10.3.0.1.0
Deploying urda - Version 4.14.8.12.2

** Getting package updates from ContentServer **

Oracle Configuration Manager has been configured in connected mode. If the
target ORACLE_HOME is running a database, please refer to the
"Post-installation Database Configuration" section of the OCM Installation
and Administration Guide
(http://www.oracle.com/technology/documentation/ocm.html) to complete the
installation.

View configuration data reports and access valuable configuration best
practices by going to MetaLink.

After the configuration is finished, the database needs to be configured too. Before running the configuration script make sure that the parameter UTL_FILE_DIR contains access to the $ORACLE_HOME/ccr/state directory. It might be needed to bounce the database since this parameter is a static parameter.

Now run the database configuration script. You might be prompted for the database instance name and/or the password. If ORACLE_SID is set and the database can be reached through OS-authentication, the script will just configure that database, without asking anything:

$ ./admin/scripts/installCCRSQL.sh collectconfig
Successfully installed collectconfig in the database with SID=ORCL.

Now run the first collection to be uploaded to Oracle Support.

./bin/emCCR collect

This forces the Configuration Manager to upload the current configuration. At this time there is also a daemon running that will upload configurations once a day. The output of the above statement should look like:

$ ./bin/emCCR collect
Oracle Configuration Manager - Release: 10.3.3.1.0 - Production
Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
------------------------------------------------------------------
Collection and upload done.


This completes the installation.

Now login to the Oracle Support Site using the same account as used during the setup. As soon as the Dashboard appears, we are looking for a Window called "Task: Associate Collections". If this window is now shown, click on the link "Customize Page..." and drag this window to a slot on the Dashboard.

Note: This window will only stay visible if there any Collections to be Associated, so probably the window will not be shown initially, unless there is already one configuration uploaded and waiting to be associated.
This window is used to link a System to a Support Identifier. Please take care, because this action can only be done once and can not be reverted !!!

The window looks like below:



Select the correct support Identifier from the pull-down menu saying "Select a Support Identifier" and Click on the "Associate" button in the bottom corner of the Window.

As soon as all the Associations are successfull, this Task window will dissappear again.

This completes the setup of the Configuration Manager. Now you're able to browse the System and receive Patch warnings.

Friday, September 17, 2010

Install Oracle Enterprise Manager Grid Control 11g on IBM AIX 5.3

In this thread the installation of Oracle Enterprise Manager Grid Control 11g is explained. The installation takes place on an IBM Pseries server running AIX 5.3 TL-10.


Server requirements:

For setting up a Grid Control server using Oracle Enterprise Manager 11gR1 on AIX 5.3 a server is needed with approx. the following specs:

Operating system: AIX 5.3 TL10
Available RAM: 6Gb ( depending on the number of databases monitored, here less than 50 databases)
Available diskspace: Approx 50Gb in total.


Software requirements

Following the official installation guide, two software prerequisites are to be present and installed:

The first is the bos.perf.proctools package
The second is a Java 6 SDK. This SDK is needed for installing the Weblogic server
The minimum Service Release is 6. On the server this would show up like:

# /usr/java6_64/jre/bin/java -version
java version "1.6.0"
Java(TM) SE Runtime Environment (build pap6460sr8-20100409_01(SR8))
IBM J9 VM (build 2.4, JRE 1.6.0 IBM J9 2.4 AIX ppc64-64 jvmap6460sr8-20100401_55940 (JIT enabled, AOT enabled)
J9VM - 20100401_055940
JIT - r9_20100401_15339
GC - 20100308_AA)
JCL - 20100408_01

Please note that the above installation has a SR8 update installed.

Oracle recommends to install the Java SDK into a separate directory within the directory that will become "Middelware Home". This is to prevent accidental upgrades from the Java version, which is not supported.
Because the software packages on IBM AIX are installed with the 'smit' tool, they end up in /usr.
To follow the Oracle recommendation, the Java installation is copied into the middleware home directory ( in our case /opt/oemprd/ora/middleware).
$ cp –r /usr/java6_64 /opt/oemprd/ora/middleware/


User requirements

Create an Oracle user who will be the owner of all installations.
Add the following entries to the .profile of this user.
export JAVA_HOME=/opt/oemprd/ora/middleware/java6_64
export PATH=$PATH:$JAVA_HOME/bin
export MW_HOME=/opt/oemprd/ora/middleware/
Note that the Weblogic server will be installed in $MW_HOME/wlserver_10.3
For all below installations, this user needs to have its DISPLAY set and a valid X-Display emulator should be running on your PC
Use e.g. the following command to set the DISPLAY to the correct value:

export DISPLAY=$(who -m | cut -d\( -f2 | tr -d " \)"):0

Weblogic Server
The installation manual for Oracle Enterprise Manager 11g dictates a minimum version of Weblogic server 10.3.2. This is THE ONLY SUPPORTED version !!!!
It is important to install the Weblogic server on a server which has no Oracle software installed yet.

Download the Weblogic server from:
http://www.oracle.com/technology/software/products/ias/htdocs/wls_main.html

Unzip the downloaded file into the homedirectory of the "oracle" user.
Start the installer using:
$ java -d64 -jar wls1032_generic.jar

After extraction the welcome screen appears


Click on Next.

Enter the base directory used as "Middleware Home".



Click on Next


Because the Oracle Enterprise Manager 11g is tightly linked to Oracle Support site, we select to enter credentials here.



Click in Next


Confirm the selected Java installation to use for weblogic.
It should state the installation directory copied above in section 2.1.



Click on Next


The installer now shows the installation homes of the two weblogic parts:
Weblogic server and Oracle Coherence.
Leave the given directories as is and click on Next.
After confirming the installation summary, the installation progress screen is displayed.

After the intallation completes, unselect "Quick Start" and press Done.

Patching WebLogic Server

Before the patching of the Weblogic server can be started, make sure you have:

- Access to the Internet from the server you are installing
- Added the following hosts to /etc/hosts

141.146.54.16 support.oracle.com
66.248.199.47 bidslb.beasys.com


Start the patch installer, using:

oemprd@vsv1h151:/opt/oemprd/ora/middleware/utils/bsu (OEMPRD)
$ bsu.sh -log=/tmp/bsu.log

The file /tmp/bsu.log can be optionally used for detailed logging of the actions the program takes.
It is possible that the installer first is going to update itself. You will be prompted for that.

In the below table the update walkthrough is displayed:


Afterwards it downloads the available Weblogic updates will be downloaded.
First you need to enter the credentials to access Oracle Support like shown in the figure.
After supplying the login credentials and having verified them the update installer appears.
Again the login credentials for Oracle Support are asked, as shown here.
We choose here to not enter the credentials, as:

• We don't need any Emails about updates
• There seems to be some kind of problem with the connection. It keeps returning asking for the credentials.

In the "Get Patches" Tab the Weblogic patch "WDJ7" is shown.
Select that patch and click download.
After a couple of seconds the download finishes and the patch dissappears from the screen.


Select the tab "Manage Patches".
In the lower part of the screen the downloaded patch "WDJ7" is shown.
Select the patch and click the "Apply" button.


The installer checks for any conflicts. If correct, no conflicts should be found.
Click OK to start the update


After the patch is installed it will appear in the upper part of the screen. This concludes the patch installations.
Select File, Exit to leave the patch installer.


Installation of Grid database

Install an Oracle 11.1.0.7 release database and patch it with the latest patches.
At the time of writing this was 11.1.0.7.4

For ORACLE_HOME during this installation /opt/oemprd/ora/db/11.1.0 is used
Create a database called OEMPRD which will be used as Catalog for the Enterprise Manager server.
Ensure that all parameters have been set correctly according to the Installation Manual. These parameters are:

*.compatible='11.1.0.0.0'
*.db_block_size=8192
*.db_name='OEMPRD'
*.job_queue_processes=1000
*.log_buffer=104865760
*.open_cursors=450
*.processes=1500
*.remote_login_passwordfile='EXCLUSIVE'
*.session_cached_cursors=400
*.sga_max_size=1G
*.sga_target=1G


Because the installation and configuration of a database should be easy for DBA's working on a Grid Control installation, the procedure is not explained here.

Before continuing, make sure the database and the listener are up and running.



Installing Oracle Enterprise Manager Grid Control

Download and extract the Grid Control installation files.
Enter the directory where the files have been extracted.

Now Download the available updates for the Grid Control installation.

The commands to do so are:
oemprd@vsv1h151:/opt/oemprd/ora/home/staging (OEMPRD)
$ cd install
$ mkdir updates
$ ./utility/downloadSWUpdates -u oracleadmins@kramp.com -p -s ./updates/

Warning: It might be that the downloadSWUpdates program fails. This is because the program "thinks" you're on a 'linux' system. It tries to search for a library file in a non-existing directory called linux.
Solve this by:
$ cd /opt/oemprd/ora/home/staging/install/updates/java/oui/lib
$ ln –s aix linux
$ cd /opt/oemprd/ora/home/staging/install/updates/java/oui/bin/platform/
$ ln –s aix linux

Start the installer using the following syntax:

oemprd@vsv1h151:/opt/oemprd/ora/home/staging (OEMPRD)
$ ./runInstaller -noconfig WLS_DOMAIN_NAME=GridControl

Below the walkthrough of the Grid Control Installation:

In the first screen you're able to enter you 'My Oracle Support' Credentials. It is recommended to do so, enabling the installer to check and download available updates.

Possible screen if Check for "Download and install updates" is selected



Select Installation Type:

Select "Install a new Enterprise Manager System" as the server in this installation does not have an OEM installed.

Click 'Next'




Check Prerequisites

The installer will list all failed prerequisite checks. If needed solve the given problem and confirm if the check runs OK afterwards

Click 'Next'




Enter the installation directory of the Oracle Enterprise manager installation and the 'Middleware Home' location ( the location where Weblogic is installed)

The installer determines the given entries and they are correct most of the time.

Click 'Next'




Weblogic Server Domain.

Since Oracle Enterprise Manager 11g runs on a Weblogic environment, a weblogic domain will be created.

Specify passwords for the given Weblogic user and Node manager.

Click 'Next'




Database credentials.

Enter the database servername, Listener Port, SID and SYS password for access to the database created for this purpose.




Setup the Management repository.

Enter a password for the to be created SYSMAN user.

Confirm the file locations of the to be created tablespaces.

Click 'Next'




Enter passwords for the Management Agent registration.

Keep this password somewhere safe, as it is needed later when installing/configuring the clients

Click 'Next'




The port numbers to be used.

It is smart to keep a print of the screen, or write down the port numbers.

Click 'Next'




Installation Summary.

Click 'Next'



Although the manual states that 400Mb of temp space in /tmp is needed, the installer actually needs 936Mb (or more), as stated on the screen from which the installer started:

Space Requirements

/tmp/ Required 936MB (only as temporary space) : Available 781MB

Increase space in /tmp and afterwards continue the installation by clicking 'Yes'.



Execution of allroot.sh script



After running a root-script for which a dialog box appears, the configuration of the product starts.


If the configuration of OMS fails ( step 5) and the logfile shows:

FINE: Cleaning the OMS from the repository

Sep 22, 2010 11:17:34 AM oracle.sysman.omsca.adapter.wls.OMSWLSAdapter adapterCleanup

FINE: Got the Host URL as: null

Sep 22, 2010 11:17:34 AM oracle.sysman.omsca.framework.OMSGenericAdapter postDeployAndReposSetup

SEVERE: OMSCA-ERR:Securing of OMS failed. Check the trace file:/opt/oemprd/ora/middleware/weblogic/oms11g/cfgtoollogs/omsca/omsca_20

100922105531.log

Sep 22, 2010 11:17:34 AM oracle.sysman.omsca.framework.OMSCAFreshInstall execute

SEVERE: Securing of OMS failed.

Sep 22, 2010 11:17:34 AM oracle.sysman.omsca.framework.OMSConfigAssistantDriver main

FINE: Got resultfalse

then the an unsupported version of WebLogic is installed .

Only version 10.3.2 is supported for Oracle Enterprise Manager 11g ( see Note: 1096766.1)


If the configuration fails at Add-on OMS side configuration, a manual fix needs to be applied to addonca.pl, as described here: http://arjunraja.wordpress.com/

This page says:

While installing Oracle 11g on AIX you could hit this failure

Do not CANCEL or EXIT the installation from the GUI mode.

A check of the logfile reveals this error -

11g Grid Control installation fails at ‘ Add-on OMS Side Configuration ‘ with following error message in

oracle.sysman.plugin.virtualization.oms_2010-06-01_01-24-07-PM.log:

at java.lang.Terminator.setup(Terminator.java:41)

at java.lang.System.completeInitialization(System.java:114)

at java.lang.Thread.(Thread.java:127)

Caused by: java.lang.RuntimeException: Can’t recognise platform – Unix

The following file needs to be backed up and edited as follows

Go to the newly installed Oracle Management server HOME/bin folder -

cp addonca.pl addonca.pl.org

Edit the file -

vi addonca.pl

Look for Unix under osname and change it to AIX -

else

{

$osname = “Unix” ;

}

to

else

{

$osname = “AIX” ;

}

Click on RETRY and the installation completes.





This completes the installation of Grid Control 11g on AIX 5.3

Instructions on how to install the Oracle 11g Management Agent can be found here:
http://managingoracle.blogspot.com/2011/01/installing-oracle-agent11g.html

Tuesday, August 24, 2010

Oracle table reorganization: Online or Offline

When using tables with large amounts of rows and especially after a lot of rows have been deleted from such a table, reorganizing the table may improve the performance of the table.
Oracle knows two types of table reorganizations.

  • Rebuilding the table by means of recreating it

  • Rebuilding the table by shrinking its free space ( Oracle 10g and up )

Below the two methods are explained.


Rebuilding the table by means of recreating it
There are two ways of rebuilding a table by recreating it. The first option is to export all the data into a file using the export utility.
After that truncate ( of recreate ) the table and reload the data back into it. The disadvantage here however is a long downtime.

Another method is moving the table either to another tablespace or within the same tablespace.
The advantage here is that the downtime will me much less, but it also has some disadvantages:


  • The tablespace needs to be able to store this second copy of the table
  • The tables indexes, any depending objects like packages, procedures can become INVALID and might need to be rebuild
  • There will be a table-level exclusive lock on the table involved. Any application using that table will be affected by this lock.

By moving the table to another tablespace temporarily, the DBA is also able to reorganize the tablespace.


Rebuilding the table by shrinking its free space
Starting with Oracle 10g all the above is replaced by a very nice feature called 'SHRINK SPACE';
As of this release, tables can be reorganized online, without affecting the users functionality.
That is:

* The table itself must be in an ASMM tablespace
* The table must have row movement enabled
* At the end of the action a shortt table level lock takes place to adjust the HWM (Highwater Mark) of the table.


And it's easy !!
Just enable 'row movement' and start shrinking tables:

SQL> ALTER TABLE MYTABLE ENABLE ROW MOVEMENT;
This enables row movement of the table. THis is necessary so that oracle is able to actually move the rows to other free extents in order to shrink the space needed.

SQL> ALTER TABLE MYTABLE SHRINK SPACE COMPACT;
This shrinks the used space of the table, but does not update the HWM. This is usefull if you don't want to have a table-level lock during business hours.

SQL> ALTER TABLE MYTABLE SHRINK SPACE;
This command shrinks the contents of the table and subsequently updates the HWM. This statement can also be used after a shrink with 'compact' option has been done, just to update the HWM.

SQL> ALTER TABLE MYTABLE SHRINK SPACE CASCADE;
This command shrinks the contents of the table and all dependent objects like indexes.


High Water Mark:
Now we have mentioned the High Water Mark (HWM) a couple of times. A short explanation:
The HWM is the pointer to that location of a table where on any point in history the table has grown to. If a lot of rows are deleted from a table, a lot of free space exists before the HWM. Doing a full table scan Oracle will not only read the actual rows, but also the empty space up to the HWM.
This is wasted time and is worsening the performance of getting data from that table.
So if from a large table a lot of rows are deleted, e.g. a count(*) will still take as long as before the delete.


Triggers and indexes:
In contradiction to moving a table, or exporting/importing its data, triggers and indexes are NOT affected by a shrink space command. They will remain valid and functional during and after the operation.


Wasted space:
How to determine if a table is a candidate for a reorganization ?

Use the below script to determine if a table has a lot of wasted space.
If it substantial in relation to the total size of the table, then a reorganization ( shrink ) is recommended.

set lines 200;
column owner format a15;
column segment_name format a30;

select
a.owner,
a.segment_name,
a.segment_type,
round(a.bytes/1024/1024,0) MBS,
round((a.bytes-(b.num_rows*b.avg_row_len) )/1024/1024,0) WASTED
from dba_segments a, dba_tables b
where a.owner=b.owner
and a.owner not like 'SYS%'
and a.segment_name = b.table_name
and a.segment_type='TABLE'
group by a.owner, a.segment_name, a.segment_type, round(a.bytes/1024/1024,0) ,round((a.bytes-(b.num_rows*b.avg_row_len) )/1024/1024,0)
having round(bytes/1024/1024,0) >100
order by round(bytes/1024/1024,0) desc ;


The above script is only working if the tables involved have statistics gathered on them


PCTFREE
An extra remark about PCTFREE.
Tables with a high value of PCTFREE show also a higher value of wasted space.
The space reserved for PCTFREE in a block can not be shrinked. So even after a shrink of a table a lot of wasted space can remain.

The effect of PCTFREE also seen before the shrink is executed when looking at the wasted space using the above script.

In an example we have three identical tables with the following structure:

SQL> CREATE TABLE T1 ( ID NUMBER(10), VAL VARCHAR2(1000 CHAR) ) PCTFREE 40;
SQL> CREATE TABLE T2 ( ID NUMBER(10), VAL VARCHAR2(1000 CHAR) ) PCTFREE 10;
SQL> CREATE TABLE T3 ( ID NUMBER(10), VAL VARCHAR2(1000 CHAR) ) PCTFREE 50;


All tables are loaded with the same data, approx. 5.5 million rows.
When running the above script it shows:

SEGMENT_NAMESEGMENT_TYPEMBSWASTED
-------------------------------------------------------------
T1TABLE50642348
T2TABLE3333617
T3TABLE61443428


Just because of the variation in the value of PCTFREE, the total size and wasted space are different per table.

The effect of PCTFREE after the shrink is that not all "wasted space" is released.
The term "wasted space" therefore is not really valid with a high PCTFREE. It is wasted space + the reserved space directly available for updates. Of course above, just after table creation there is no wasted space. It is only reserved (pctfree) space. Therefore - for this example - we are going to delete approx 30% of the rows of all tables.

SQL> delete from t1 where length(val)< 300;

SQL> delete from t2 where length(val)< 300;

SQL> delete from t3 where length(val)< 300;

As soon as the delete is finished we compute stats on these tables in order for the wasted-space script to have a valid calculation base.
Running the script again now shows:

SEGMENT_NAMESEGMENT_TYPEMBSWASTED
-------------------------------------------------------------
T1TABLE50642789
T2TABLE33331058
T3TABLE61443869

After this we are going to 'shrink' the available wasted space, using:

SQL> ALTER TABLE T1 SHRINK SPACE CASCADE;
SQL> ALTER TABLE T2 SHRINK SPACE CASCADE;
SQL> ALTER TABLE T3 SHRINK SPACE CASCADE;

The result is:


SEGMENT_NAMESEGMENT_TYPEMBSWASTED
-------------------------------------------------------------
T1TABLE42451970
T2TABLE2779505
T3TABLE5292

3018

This test clearly shows that even if Oracle or its tools like ADDM inform you about wasted space in a table, it might be that this is because you have a high value for PCTFREE. Shrinking these table might afterwards look like if nothing happened.
Check in that case if PCTFREE is high and if the value of 'wasted space' divided by the 'total size' is equal to that value:

Example:

Table T1: PCTFREE = 40% ==> 1970 / 4245= 46%
Table T3: PCTFREE = 50% ==> 3018/5292 = 57%

So, actually the 'wasted space' of these tables is respectively 6% and 7 %

Wednesday, July 14, 2010

AIX based RAC reboots nodes by itself

Problem statement:

After a couple of weeks running with no problem at all, our 2 node RAC running on Oracle release 11.1.0.7 on AIX 5.3 TL10 platforms started rebooting nodes by itself.


Analysis:

Analyzing the CRS, ASM and Rdbms logfiles, it simply showed the server had terminated. There were no logs to be found of a gracefull shutdown of the cluster resources.

In contradiction to this, the AIX errpt showed a "USER SHUTDOWN" as the reason for the reboot. From the viewpoint of AIX someone or something had issued a system shutdown.
Because the servers didn't crash, but restarted instead, the UNIX admins confirmed it must have been a shutdown command with restart options that brought the system down.
However in such a case, one would expect the Cluster resources to go down gracefully which would have been seen in the CRS logs.

Oracle support confirmed the sudden restart of the system, by seeing a time-gap in the logfiles. The logging of actions suddenly stopped on all resources, and then started again when the cluster resources where coming up again.


Resolving the issue:

Beause we were unable to find a cause of our own, and Oracle support was searching into the direction of a system problem ( believing that not a RAC process had restarted the server ) the created S/R got somehow stuck. There were no new things found to discover.

Only after escalating the S/R to a prio 1, after just another unscheduled reboot, Oracle support started asking about the process status of the 'oprocd' daemon.
It triggered us in starting to browse the internet for information about the 'oprocd' daemon.
Using Note 265769.1 we pinpointed the reason of the reboot must be with oprocd.

Finally we found the logfile that made sense. In /etc/oracle/oprocd we found the logfile from just before the reboot. It clearly stated:

AlarmHandler: timeout(1892 msec) exceeds interval(1000 msec)+margin(500 msec). Rebooting NOW.

What does this mean, and what is 'oprocd' ??

The 'oprocd' process is a processor monitor daemon. It monitors CPU load by setting a timer to 1000ms. After that is gets into a sleep for 1000ms. As soon as it wakes up it checks if the sleep period has been as long as the set timer. If it doesn't the maximum margin to deviate from the set timer is by default 500ms ( half a second ). If it wakes up beyond that margin, the oprocd decides to reboot the system.

Now that we really got a message to work on, we dumped that into Google and it came up with a really helpfull page: http://www.aixmind.com/?p=845
This internet page clearly described all the problems we've had, including all the symptoms and logfile entries.
It explained a couple of steps on how to solve this problem.

We send the page to Oracle support, asking if we could be suffering from that problem. After an hour we received an "Action plan" to solve the problem. The action plan Oracle support send us, matched for 90% the instructions of the page we sent.
Oracle support also informed us that this was a known problem, starting to develop as early as release 10.2.0.4.
They confirmed the default values 'oprocd' is running with ( interval 1000ms and margin 500ms ) are too low. In the newest release - I recon 11.2 - these defaults will be increased.

I think it is unfortunate that it takes three weeks to come with a 'known' problem and then only after we sent a hint to Oracle support.


The solution:
Oracle Support send us the following action plan to be implemented:

1. Execute as root
#crsctl stop crs

2. Execute as root
#/bin/oprocd stop

3. Ensure that Clusterware stack is down by running ps -ef egrep "crsd.binocssd.binevmd.binoprocd". This should return no processes.

4. From one node of the cluster, change the value of the "diagwait" parameter to 13 seconds by issuing the command as root:
#crsctl set css diagwait 13 -force

5. From one node of the cluster execute the following command as root:
#crsctl unset css misscount -force

6. Restart the Oracle Clusterware on all the nodes by executing:
#crsctl start crs

7. Validate that the node is running by executing:
#crsctl check crs

This solution increases the value for the margin from 500ms to 10000ms, so 20x higher.

This will :
a) Give less chance that oprocd get outside of the total time period ( 11000ms ).
b) If there is somehow a problem, which will bring oprocd outside of this time, the system has an addition 9500ms to write more informational logs to disk.


Unfortunately our RAC nodes did not come up correctly after the change.

The 'crs_stat -t' showed:

# ./crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....D1.inst application ONLINE OFFLINE
ora....D2.inst application ONLINE OFFLINE
ora.WCSPRD.db application ONLINE OFFLINE
ora....RD1.srv application ONLINE OFFLINE
ora....RD2.srv application ONLINE OFFLINE
ora....sprd.cs application ONLINE OFFLINE
ora....SM1.asm application ONLINE ONLINE vsv1h171ps
ora....PS.lsnr application ONLINE ONLINE vsv1h171ps
ora....1ps.gsd application ONLINE ONLINE vsv1h171ps
ora....1ps.ons application ONLINE ONLINE vsv1h171ps
ora....1ps.vip application ONLINE ONLINE vsv1h171ps
ora....SM2.asm application ONLINE ONLINE vsv1h181ps
ora....PS.lsnr application ONLINE ONLINE vsv1h181ps
ora....1ps.gsd application ONLINE ONLINE vsv1h181ps
ora....1ps.ons application ONLINE ONLINE vsv1h181ps
ora....1ps.vip application ONLINE ONLINE vsv1h181ps

and the alert_log of the databases kept showing:

Tue Jul 13 22:11:18 2010
Starting ORACLE instance (normal)
Tue Jul 13 22:12:08 2010
Starting ORACLE instance (normal)
Tue Jul 13 22:12:56 2010
Starting ORACLE instance (normal)
Tue Jul 13 22:13:24 2010
Starting ORACLE instance (normal)
Tue Jul 13 22:13:51 2010
Starting ORACLE instance (normal)
Tue Jul 13 22:28:17 2010
Starting ORACLE instance (normal)

The nodes had to be completely rebooted to get up and running.
This might be explained by the fact that oprocd is build into the AIX kernel, which would mean he kernel has to be reloaded after the above changes.