Search this Blog

Friday, July 13, 2012

Hash Join Anti NA

Another simple day at the office.....

What was the case.
A colleague approached me telling that he had two similar queries. One of them returning data, the other not.
The "simplified" version of the two queries looked like:

SELECT col1
  FROM tab1
 WHERE col1 NOT IN (SELECT col1 FROM tab2);

This query returned no data, however he -and later on I also- was sure that there was a mismatch in the data, which should have returned rows. 
This was also proven/shown by the second query:

SELECT col1
  FROM tab1
 WHERE NOT EXISTS
          (SELECT col1
             FROM tab2
            WHERE tab1.col1 = tab2.col1);

This query returned the expected difference. And this query does in fact the same as the first query!!
Even when we hardcoded an extra WHERE clause, the result was the same. No rows for:

SELECT *
  FROM tab1
 WHERE  tab1.col1 NOT IN (SELECT col1 FROM tab2)
       AND tab1.col1 = 'car';

and the correct rows for:

SELECT *
  FROM tab1
 WHERE     NOT EXISTS
              (SELECT 1
                 FROM tab2
                WHERE tab1.col1 = tab2.col1)
       AND tab1.col1 = 'car';

After an hour searching, trying to reproduce the issue, I almost was about to give up and send it to Oracle Support qualifying it as a bug.
However, there was one difference that I saw, that could be the cause of the problem. 
Allthough the statements are almost the same, the execution plan showed a slight difference. The execution plan for the NOT IN query looked like:

Plan
SELECT STATEMENT ALL_ROWS Cost: 5 Bytes: 808 Cardinality: 2

3 HASH JOIN ANTI NA Cost: 5 Bytes: 808 Cardinality: 2


1 TABLE ACCESS FULL TABLE PIM_KRG.TAB1 Cost: 2 Bytes: 606 Cardinality: 3


2 TABLE ACCESS FULL TABLE PIM_KRG.TAB2 Cost: 2 Bytes: 404 Cardinality: 2

Whereas the execution plan of the query with the NOT EXISTS looked like:

Plan
SELECT STATEMENT ALL_ROWS Cost: 5 Bytes: 808 Cardinality: 2

3 HASH JOIN ANTI Cost: 5 Bytes: 808 Cardinality: 2


1 TABLE ACCESS FULL TABLE PIM_KRG.TAB1 Cost: 2 Bytes: 606 Cardinality: 3


2 TABLE ACCESS FULL TABLE PIM_KRG.TAB2 Cost: 2 Bytes: 404 Cardinality: 2


See the difference?

Not knowing what a "HASH JOIN ANTI NA" exactly was, I entered it as a search command into the knowledge base of My Oracle Support. Besides a couple of patch-set lists, I also found Document 1082123.1, which explains all about the HASH JOIN ANTI NULL_AWARE.

In this document the behaviour we saw is explained, with the most important remark being:
'If t2.n2 contains NULLs,do not return any t1 rows and terminate'

And then it suddenly hit me as I was unable to reproduce the case using my own created test tables.

In our case, it meant that if tab2.col1 would have contained any rows with a NULL value, the join between those two tables could not be made based on a "NOT IN" clause.
The query would terminate without giving any results !!!
And that is exactly what we saw.

The query with the NOT EXISTS doesn't use a NULL_AWARE ANTI JOIN and therefore does return the results

Also the mentioned workaround:

alter session set "_optimizer_null_aware_antijoin" = false;

seems not to work. Allthought the execution plan changes to:

Plan
SELECT STATEMENT ALL_ROWS Cost: 4 Bytes: 202 Cardinality: 1

3 FILTER


1 TABLE ACCESS FULL TABLE PIM_KRG.TAB1 Cost: 2 Bytes: 606 Cardinality: 3


2 TABLE ACCESS FULL TABLE PIM_KRG.TAB2 Cost: 2 Bytes: 404 Cardinality: 2

it still returns no rows !!


And Now??

Since there is a document explaining the behaviour, I'm doubting if we can classify this as a bug. But in my opinion, if developers do not know about this strange behaviour, they will easily call it a bug.

The "problem" is easily solved ( or worked around ) using the NOT EXISTS solution, or using NVL with the  JOINed columns. However I would expect the optimizer to sort these things out himself.

For anyone who wants to reproduce/investigate this case, I have listed my test-code. The database version we used was 11.1.0.7 on Windows 2008 R2. I'm sure the OS doesn't matter here.


-- Create two tables, make sure they allow NULL values
CREATE TABLE tab1 (col1 VARCHAR2 (100) NULL);
CREATE TABLE tab2 (col1 VARCHAR2 (100) NULL);

INSERT INTO tab1
VALUES ('bike');

INSERT INTO tab1
VALUES ('car');

INSERT INTO tab1
VALUES (NULL);

INSERT INTO tab2
VALUES ('bike');

INSERT INTO tab2
VALUES (NULL);

COMMIT;

-- This query returns No results
SELECT col1
  FROM tab1
 WHERE col1 NOT IN (SELECT col1 FROM tab2);

-- This query return results
SELECT col1
  FROM tab1
 WHERE NOT EXISTS
          (SELECT col1
             FROM tab2
            WHERE tab1.col1 = tab2.col1);



Success!!


Tuesday, May 15, 2012

Interval partitioning


Interval partitioning is a partitioning method introduced in Oracle 11g. This is a helpful addition to range partitioning where Oracle automatically creates a partition when the inserted value exceeds all other partition ranges.
The following restrictions apply:
·         You can only specify one partitioning key column, and it must be of NUMBER or DATE type.
·         Interval partitioning is NOT supported for index-organized tables.
·         You can NOT create a domain index on an interval-partitioned table.
·         Partition names are generated automatically with they Syntax “SYS_xxxx”

Interval partitioning based on a numeric value needs a correct interval definition. For e.g. the known “monthcode” value, a create table statement would look like:

CREATE TABLE autopartition
(
   monthcode        NUMBER(6)
,  employee          VARCHAR2 (40)
,  salary             NUMBER (10)
)
PARTITION BY RANGE
   (monthcode)
   INTERVAL ( 1)
   (
      PARTITION p_first
         VALUES LESS THAN (201202));

In the above table, only the first partition is named. If we insert data beyond the partition definition, a new partition is created.
Because the partitions are named automatically, Oracle has added new syntax to reference the specific partitions effectively by using either the generated name:

select
   *
from
   autopartitionnum partition (SYS_P62);

 or the range of a partition:

select
   *
from
   autopartitionnum partition for (201201);

Note that, because the partitions are names automatically, with an incremental numbering, partition SYS_P62 can belong to table A, whereas partition SYS_P63 van belong to table B, etc..etc..



Interval partitioning for DATE columns can be used based on year or month bases, using the NUMTOYM function in the INTERVAL table definition.
For interval partitioning on a more distinctive level, like days, a function called NUMTODSINTERVAL is available.
Creating an Interval partitioned table –seperated on month– then looks like:

Creating an Interval partitioned table –seperated on month– then looks like:

CREATE TABLE autopartitionmonth
(
   saldate        DATE
,  employee       VARCHAR2 (40)
,  salary         NUMBER (10)
)
PARTITION BY RANGE
   (saldate)
   INTERVAL ( NUMTOYMINTERVAL (1, 'MONTH') )
   (
      PARTITION p_first
         VALUES LESS THAN (TO_DATE ('01-01-2012', 'DD-MM-YYYY')));

Or, with a separation of 1 day:


CREATE TABLE autopartitionday
(
   saldate        DATE
,  employee       VARCHAR2 (40)
,  salary         NUMBER (10)
)
PARTITION BY RANGE
   (saldate)
   INTERVAL ( NUMTODSINTERVAL (1, 'DAY') )
   (
      PARTITION p_first
         VALUES LESS THAN (TO_DATE ('01-01-2012', 'DD-MM-YYYY')));



Conclusions:
Interval partitioning can be used to automatically add new partitions when data is inserted into a partitioned table that is out of range of the available partitions.
We can easily switch to normal manual partitioning by omitting the “
INTERVAL ( NUMTODSINTERVAL (1, 'DAY'))” line in the create statements
Advantages:
·         No more worries about partitions to be created automatically, oracle takes care of it
·         A partition “DEFAULT” for all values outside existing partitions is no longer necessary. This partition in old releases often worked like a trashcan.
Disadvantage
·         Partition naming is automatic, this means subsequently created partition names, can be part of different tables.




Index partitioning


When using Indexes on partitioned tables, make sure to use LOCAL indexes as much as possible. This means that the Index is also setup as a partitioned Index, with manageable parts on each table partition.
Where in the MISPRD all indexes where dropped before a certain table was loaded, in Oracle 11g it is now possible to set a certain Index or part of an index to “UNUSABLE”. This means the Index will not be updated during table changes, and will also not be used during selects from that table. As soon as the table changes are completed, the disabled Index or index parts cat simply rebuilt.

An example for the “autopartition” table above.
Let’s assume we want to insert/update 1000’s of records of the current month (201205) in this table.
Since we know the current month, we can use a statement like mentioned above to disable the Index for that particular partition:

ALTER TABLE autopartition MODIFY PARTITION FOR (201205) UNUSABLE LOCAL INDEXES; 

This sets the Index for that partition to UNUSABLE. This is confirmed by selecting the partition name with the high_value from the dba_ind_partitions view, like:

SELECT index_name, partition_name, high_value
  FROM dba_ind_partitions
 WHERE status = 'UNUSABLE;

The result looks like:
INDEX_NAME
PARTITION_NAME
HIGH_VALUE
AUTOPARTITION_X1
SYS_P69
201206

As soon as the load action is completed, all indexes partitions that are set to unusable can be found using the following query:

SELECT 'alter index '||index_name||' rebuild partition '||partition_name||';'
  FROM dba_ind_partitions
 WHERE status = 'UNUSABLE';'



Global Temporary Tables


Triggered by the huge amount of archive files generated by several databases each night, we have been looking for a solution to reduce this dramatically.
This solution exists !!
And it is called “Global Temporary Tables”.

The definition:

 Applications often use some form of temporary data store for processes that are too complicated to complete in a single pass. Often, these temporary stores are defined as database tables or PL/SQL tables. As of Oracle 8i , the maintenance and management of temporary tables can be delegated to the server by using Global Temporary Tables.
The data in a global temporary table is private, such that data inserted by a session can only be accessed by that session. The session-specific rows in a global temporary table can be preserved for the whole session, or just for the current transaction.
Finally the data stored in global temporary tables generate NO REDO and thus reduce the amount of ARCHIVE FILES.

The usage

Above, three words are marked in bold. These words indicate the behavior of a temporary table. Although a global temporary table can exists in a global schema, like EHDA_KRG, all data inserted into that table is session-specific. This means that if more than one user session is inserting/updating data in a global temporary table, only the data of that user’s session is affected. So, although you’re working in the same temporary table you’re working on your own data. This makes the table private.
Furthermore there are two major differences in behavior. Default the table only keeps the data inserted/updated until the transaction is finished/committed. This is known as “ON COMMIT DELETE ROWS”.
The other behavior keeps all rows/changes in the table until the end of your session. This behavior is known as “ON COMMIT PRESERVE ROWS”. If you want the temporary table to behave this way, you have to include this command in the table create statement.

The creation of a global temporary table, that preserves the data until the session is ended, looks like:

CREATE GLOBAL TEMPORARY TABLE
   SELECT * FROM .....
ON COMMIT PRESERVE ROWS
AS .......

or

CREATE GLOBAL TEMPORARY TABLE
( )
ON COMMIT PRESERVE ROWS ;

And the creation of a global temporary table, that preserves the data only within a transaction, looks like:

CREATE GLOBAL TEMPORARY TABLE
ON COMMIT DELETE ROWS
AS
   SELECT * FROM .....
or

CREATE GLOBAL TEMPORARY TABLE
( )
ON COMMIT DELETE ROWS ;

Some other features:
  • If the TRUNCATE statement is issued against a temporary table, only the session specific data is trucated. There is no affect on the data of other sessions.
  • Data in temporary tables is stored in temp segments in the temp tablespace.
  • Data in temporary tables is automatically deleted at the end of the database session, even if it ends abnormally.
  • Indexes can be created on temporary tables. The content of the index and the scope of the index is that same as the database session. Indexes can however only be created on empty tables !!
  • Views can be created against temporary tables and combinations of temporary and permanent tables.
  • Temporary tables can have triggers associated with them.
  • Export and Import utilities can be used to transfer the table definitions, but no data rows are processed.
  • There are a number of restrictions related to temporary tables but these are version specific.
  • A temporary table can only be dropped if all users that have data in it, have issued the 'truncate table' command against it, or disconnected their session. Otherwise an ORA-14452 is generated.



Friday, April 6, 2012

Oracle Manual Standby (no Dataguard )

This entry is outdated. A revised solution can be found here:
http://managingoracle.blogspot.nl/2013/06/oracle-manual-standby-revised-no_13.html

A new personal challenge. How to create a manual standby database using Oracle11gR2 Standard Edition.

As everybody might know, the official Standby database solution is Oracle Dataguard, only available however as an option with Oracle's Enterprise Edition of the database software. Since we have a lot of Standard Edition databases and we didn't want to pay the license fee for Enterprise Edition and the Dataguard option, I went on searching myself for a solution.

With dataguard a primary and a secondary (standby) database exist. These database servers know eachother as FAL_SERVER and FAL_CLIENT. The FAL_SERVER is the primary, serving archivelog information to the standby server, which in his turn 'replays' this information onto his mounted database, thus syncing the database with the primary.
Because this is a dataguard automated process, completely covered by the socalled dataguard broker, there is no need for any crontab entry what-so-ever to get this working.

In our case, where I don't want to use Dataguard, I don't have a broker process, responsible for transferring the archivelog information and replaying this on the standby database. I had to find a way, to get this done manually.
Now the actual method is much the same. Without dataguard, the only way to have a similar result is by using a continuing process of database recovery.
So simply said: Put the standby database in mount mode, get the archivelog file from the primary and recover the database until the last archivelog available. This process can be repeated as often as necessary.

One remark has to be made upfront:
A manual standby database always lags behind the primary database. There where dataguard standby databases are usually in sync up to the last transaction made, this manual standby is in sync up to the beginning of the last archivelog received. 

So, there is always a gap to overcome, in case the standby database needs to become active. Depending on the scenario of the disaster, we need to cope with having lost database information of the size of the last archivelog and the active redo-log, or we need to replay the last archivelogs and possibly the redo-logs from primary to get the database completely in sync. As said, this is totally depending on the situation of the disaster.


What did I use for this project:
  • Two Redhat 6 Linux virtual machines, both setup with Oracle's Virtualbox. Both VM's have 1Gb of memory, and 30Gb of diskspace. 
  • I named the VM's "dgmaster" for the primary and "dgslave" for the standby host. In this blog I will refer to them as such.
  • Both VM's are identically installed, with Oracle 11.2.0.3 SE, with the exact file system naming.
  • On dgmaster a database was configured, called ORCL.
  • On dgslave initially an offline copy of dgmaster's ORCL database was restored/copied and brought up into mounted state. This ofcourse brings the need of transferring the init.ora / spfile, and the creation of a password file.

The goal was to set up a fully automated process that could run from thne crontab of the standby server "dgslave". To get this working we need to have the following setup:
  • SSH equivalency between the two Oracle users on both nodes. The oracle user must be able to transfer archive files from dgmaster to dgslave, without the need of entering a password. Information about how to setup SSH equivalence can be found here:  http://managingoracle.blogspot.com/2010/06/ssh-equivalence-on-rac.html
  • Determine a way of automatic transfer of archivelogs and also determine the time to recover to, related to the last retrieved archivelog. I selected to use the 'first_time#' of the last archivelog received as the point to recover to. This makes sure that every time the recovery will succeed, as all needed information is available. 
  • The master ( primary ) database is up and running in 'open' state. It is e.g. in use as production database.
  • The standby ( secondary) database is in mounted state, so it is closed
  • The primary database has archive logging enabled. The log_archive_format of the primary is copied to the spfile of the standby database. This makes sure that dgslave can find the files it has retrieved, when starting the recovery.
  • Make sure the primary database has forced logging enabled. This is necessary to be able to log really all changes in the redo files, as all changes need to be send to the standby server. This can be configured using the statement:
    • ALTER DATABASE FORCE LOGGING;
I'm sure every DBA can setup two Virtual machines running an Oracle installation. I'm also sure that they will succeed in setting up SSH equivalence, so when everything is setup, it is time to discuss the steps that need to be done over and over again.

1) Dgmaster executes a forced log-switch
2) Dgmaster determines the 'first_time#' entry of the archivelog file written with this logswitch and sends it to dgslave.
3) Dgslave fetches all missing archivelogs from the master 
4) With the retrieved 'first_time#' entry, the recovery of this mounted database can continue another step. 

To make this run automatically, dgslave has the initiative. Dgslave runs the necessary scripts on dgmaster, by means of a ssh-connection. This makes it also possible to retrieve information sent by dgmaster.
Below the script that is used on dgmaster is shown. This script is - in my test system - in the home-directory of the oracle user. You can put it anywhere you like on dgmaster, but then make sure to add that location to the script on dgslave:


#
# Force a logswitch to get the last archivelog to the standby host
#
ORACLE_SID=ORCL
ORAENV_ASK=NO
. oraenv >/dev/null 2>&1


SwitchLogfile()
{
  #
  # Do logswitch  
  #
  RESULT=`echo "Alter system switch logfile;" | sqlplus -S / as sysdba | grep 'System altered'`
  if [ "$RESULT" = "System altered." ]
  then
export RETURN=1
  else
export RETURN=0
  fi
  # Do we need to do something with this return value?
  export RETURN
}


GetArchiveTime()
{
  CURYEAR=`date +%Y`
  echo "set heading off;" > temp.sql
  echo "set termout off;" >> temp.sql
  echo "select to_char(first_time,'YYYY-MM-DD HH24:MI:SS') from v\$archived_log where sequence#=(select sequence# - 1 from v\$log where status='CURRENT');" >> temp.sql
  sqlplus -S / as sysdba <
spool tempres.txt
@temp.sql
quit
EOF

cat tempres.txt | grep ${CURYEAR} | grep -v grep | awk '{print $1" "$2}'

#rm -f temp.sql  tempres.sql


SwitchLogfile
GetArchiveTime

In this script two functions are recognized. The first is responsible for executing a log switch, making the database write the current redo-log file into and archived logfile. 
The second function determines the timestamp of the 'first_time#' of this archived logfile. This is the timestamp of the first database change recorded in this file. One can understand that here the gap is created which causes the standby database to lag behind.
The above script is never executed by itself on the dgmaster. It is always executed by a remote call from dgslave. 
The script used on dgslave looks like below:


# Recover a manual standby database to the current moment
#
#
export ORACLE_SID=ORCL
export ORALCE_USER=oracle
export ARCHPATH=/opt/mandg/ar01
#
# The hosts
#
export MASTER=dgmaster
export SLAVE=dgslave
#
# The current date and time
#
CURDATE=`date +%Y-%m-%d`
CURTIME=`date +%H:%M:%S`

ForceRemoteLogSwitch()

  # This function calls a remote script that forces a log-switch and return
  # the 'first_time' of subsequently just written archive-file
  #
  ARCHTIME=`ssh ${MASTER} /opt/mandg/ora/home/RecoverDbSwitchLogfile.sh`
  export ARCHTIME
}

GetArchives()
{
 # 
 # copy archives from remote host archive directory to the local d irectory
 #
 for remfile in `ssh ${MASTER} "ls ${ARCHPATH}/*arc"`
 {
   locfile=${ARCHPATH}/`echo ${remfile} | cut -d'/' -f5`
   if [ ! -f  ${locfile} ]
   then
      scp ${MASTER}:${remfile} ${ARCHPATH} 
   fi
 }
}

CreateSqlRecoverScript()
{
  RECOVERTIME=$1
  echo "connect / as sysdba" > Recover${ORACLE_SID}.sql
  echo "shutdown abort;" >> Recover${ORACLE_SID}.sql
  echo "startup mount;" >> Recover${ORACLE_SID}.sql
  echo "recover database until time '${RECOVERTIME}' using backup controlfile;"  >> Recover${ORACLE_SID}.sql
  echo "auto" >> Recover${ORACLE_SID}.sql
  echo "quit;"  >> Recover${ORACLE_SID}.sql 
}


#
# Main
#
ARCHTIME=FALSE

ForceRemoteLogSwitch

echo {$ARCHTIME}
if [ ! "$ARCHTIME" = "FALSE" ]
then
#
# Logswitch Succeeded, get the Archives now and recover
#
  GetArchives
  CreateSqlRecoverScript "${ARCHTIME}"
  sqlplus /nolog <
@Recover${ORACLE_SID}.sql
quit
EOF
fi

The script on dgslave has three main functions. The function ForceRemoteLogSwitch takes care of the remote logswitch. As we are assigning the output of the remote call to a variable ARCHTIME, we can use that later on for the recovery process.


The second function GetArchives is responsible for transferring the needed archives from dgmaster to dgslave. As both VM's are configured identically, the ARCHPATH on both servers is the same. This function only copies archived logfiles from dgmaster to dgslave which have not been copied already before.

The third function CreateSqlRecoverScript creates a local script which contains the recovery statements. As it is in fact unknown how the last recovery ended, we start the recovery with a 'shutdown abort' making sure we start recovery in a correctly mounted database. You can recognize here the usage of the retrieved ARCHTIME, called RECOVERTIME here, as it is passed to this function as argument $1.
Also seen here is the usage of the 'auto' recover option. This will make the recovery uninteractive, but it needs a correctly configured log_archive_dest pointing to the location where the archive files have been put.

Finally the Main part of this script, calls all above functions and finally starts an SQLplus session, with a call to the created Recoveryscript.
That's it. Nothing really special or complex. It works like a charm!!

Some additional remarks:
As we are using a database recovery method the process can not handle the creation of new tablespaces and or new datafile ( to an existing tablespace ).
When we e.g. execute on the primary dgmaster the following statement:

alter tablespace example add datafile '/opt/mandg/db01/example02.dbf' size 200m;

it will result on the dgmaster during the next recovery step into an

ORA-01244: unnamed datafile(s) added to control file by media recovery
ORA-01110: datafile file 7: '/opt/mandg/db01/example02dbf' 

This is also true when e.g. creating a new tablespace. This can be solved using the following steps:

Lookup the name of the datafile that was generated by the recovery process. It is identified by the fact that it resides within $ORACLE_HOME/dbs and is

called UNNAMEDnnnnn, where nnnnn is a number.
If the name is found, use the below statement to recreate the correct datafile, using:

alter database create datafile 
'/opt/mandg/ora/11.2.0.3/dbs/UNNAMED00007' 
as '/opt/mandg/db01/example02dbf' ;

In this example /opt/mandg/ora/11.2.0.3 is the ORACLE_HOME on the VM's we created.
After that the automated recovery can continue.