Monday, October 22, 2018

Exadata X7 -- Diagnostics iso, M.2 SSD Flash devices and the Secure Erase utility

After a while, I m here to write about Exadata.

As we are doing lots of migration projects, we are dealing with Exadata most of time and actually maybe this is the reason why I m writing less than before :)

Anyways, this post will be about 3 things actually.

1) The Secure erase utility, which is used to to securely erase all the information on the Exadata servers.

2) The diagnostics iso , which is a tool to be used to boot theExadata nodes to diagnose serious problems when no other way exists to analyze the system due to damage to the system.

3) The M.2 SSD devices in Exadata X7, which may be used for system boot and rescue functions.

As I mostly do, I will explain these 3 things by going through I real life story.

Recently, my team needed to erase a Exadata X7-2 1/4 machine after a POC.

We needed to use the Secure Erase utility and we needed to delete the data using the 3pass method. (note that there are also cyrpto and 7pass methods..)

According to the documentation, we needed to download the secure erase ISO and boot the nodes using PXE or USB. (note that, booting an Exadata X7 server using PXE boot is not that easy -- because of UEFI..)

While trying to boot the Exadata X7 cells, we first encountered an error .(Invalid signature detected , check secure boot policy in setup)..

This was actually an expected behaviour, as the documentation of Exadata was lacking the information for PXE booting an UEFI system.

At this point, we actually knew what we needed to do..

The solution was actually documented in another Oracle Documentation. 


However, we didn't have the time to implement that. So we just disabled the secureboot in BIOS and rebooted the nodes.

Well.. After this move, cell nodes couldn't boot normally and we found ourselves in the diagnostics iso shell :)

This diagnostics shell was a result of the automatic boot that is done using the diagnostics iso residing in M.2 flash SSD devices.. 


Note that, in Exadata X7, we don't have internal USB devices anymore. USB devices were replaced by M.2 flash SSD devices.. So we have 2 M.2 flash devices for recovery purposes in Exadata X7 cells.


Well. we logged in to the diag shell using root/sos1exadata and we found that, there is a Secure Erase utilty inside /usr/sbin :)

So we got ourselves our erasing utility without actually doing anything :)

We booted the cells one by one and started deleting the data on them, using secureeareser 3 pass ->  

/usr/sbin/secureeraser -–erase --all --hdd_erasure_method 3pass --flash_erasure_method 3pass.

Note that, 3pass takes a long time.. (and it is directly depends on the sizes of  disks)

So far so good.

We were deleting the data on cells, but what about the Compute nodes?

Compute nodes don't have such a diag shell present, so we needed to boot them with an external usb, and execute the Secure Eraser through External USB, as explained in "Exadata Database Machine Security Guide".

At the end of the day, we have seen/learned 5 things ->

1) Secure eraser is present in the diag iso that comes with the M.2 devices in Cells.

2) Secure eraser's 3pass erasure method takes a really long time. (2-3 days maybe)

3) Oracle documentation in MOS is lacking the information on how to boot an UEFI system (Exadata X7) with PXE. That's why people keep saying that X7 can not boot with PXE.. Actually that's wrong.

4) Each Exadata X7 cell comes with 2 x M.2 SSD Flash devices (each 150 GB) for rescure operations. (No USBs anymore)

RDBMS -- XTTS (+incremental rman backups) -- how to do it when the source tablespaces are read only?

In my previous post, I mentioned that the method explained in MOS document " 11G - Reduce Transportable Tablespace Downtime using Cross Platform Incremental Backup (Doc ID 1389592.1)" requires the source tablespaces to be in READ-WRITE mode.

The xxtdriver.pl script which is in the core of the XTTS method powered by incremental backups, just checks if there are any tablespaces in READ ONLY or OFFLINE mode..

What if our source tablespaces is in READ ONLY mode? --> According to that note, the rule is simple -> if tablespace is read only, use traditional TTS.

Why? Because it is a requirement of the process described in Note 1389592.1.. This requirement is checked inside the xxtdriver.pl and that's why we cannot execute this XTTS(+incremental backup) method for readonly tablespaces.

I still don't understand this requirement, because there seem to be no technical impossibilities for this.
Anyways; in my opinion, Oracle wants us to use this XTTS (+incremental backup) method when it is really required..

But, there are scenarios where a read only environment is required to be migrated using XTTS (+incremental backup) method.

One of these scenarios is an active dataguard environment, and the other one is described in the following sentence->

During a migration process ; a source environment can be readonly in t0 (time 0), then can be taken in to  read-write mode in t1 (time1), and then can be taken to readonly in t3(time 3).

So far so good.

What if we want to use XTTS (+incremental backup) method for read only tablespaces?
Then, my answer is use/try the manual method.

As I already wrote this down in my previous blog post; XTTS based conversion is done using the sys.dbms_backup_restore package. 

xxtdriver.pl uses it and we can manually execute it too! 

Moreover; technically, if we use that manual method, we do *not* need to make source tablespaces read write.

In the following url, you can see how this is done manually -> https://connor-mcdonald.com/2015/06/06/cross-platform-database-migration/

Ref: https://connor-mcdonald.com

Although I haven't test it yet, I believe, we can migrate our read-only source tablespaces  using a manual XTTS(+incremental backup) as described in the blog post above. (**this should be tested )

There is one more question.. What about using a Active Dataguard environment as the source for a XTTS (+incremental backup) based migration? .. 

Well.. My answer for this question, is the same as above. I believe, it can be done.. However; it should be tested well, because Oracle clearly states that -> "It is not supported to execute this procedure against a standby or snapshot standby databases".. (**this should be tested )

Thursday, September 27, 2018

RDBMS -- XTTS (+incremental backups) from Active Dataguard / not supported! /not working! & how does XTTS scripts do these endian conversions of incremental backups?

Hi all,

I just want to highlight something important.

That is;
If we want to use XTTS (Cross Platform Transportable Tablespace) method and reduce its downtime with rman incremental backups; our source database can't be a standby.. It also can't be an Active dataguard environment.

This is because, the script which is in the core of the XTTS method powered by incremental backups, just checks if there are any tablespaces in READ ONLY or OFFLINE mode. Actually it wants all to the tablespaces which are wanted to be migrated, to be in READ WRITE mode.

If the scripts finds a READ ONLY tablespace, it just raises and error ->

RAISE_APPLICATION_ERROR(-20001, 'TABLESPACE(S) IS READONLY OR,
OFFLINE JUST CONVERT, COPY');

I just want to highlight this, as you may be planning to use an Active Dataguard environment for the source database in an XTTS (+incremental backups) based migration project.. As you know, in active dataguard we have read only tablespaces, so this might be an issue for you.

Anyways, actually I was also curios about this READ-WRITE requirement of XTTS (+incremental backups) and yesterday I jumped into the XTTS scripts.

Unfortuneatly, I couldn't find anything about it there.. I couldn't still answer the question Why? Why XTTS(+incremental) method requires the source tablespaces to be in READ-WRITE mode..

However, the perl script named xttdriver.pl just checks it. I couldn't find any clue (no comments in the scripts, no documentation, nothing on web) about this requirement, but look what I have found :)

->

In 12C, RMAN has the capability to convert backups.. In 12C, rman can convert the backup even in cross platform and XXTS actually use this rman's capability to convert the incremental backups from source to target platform..

So if your database is 12C, XTTS (those scripts I mean) use "backup for transport" and "restore from platform" syntax of rman  to convert your backups.

Ofcourse, if your database is an 11G, then those rman commands are not available.. 
So what XTTS does for converting your backups in 11G environments, is using the "sys.dbms_backup_restore" package..

XTTS use it similar to the following form to convert the incremental backups to target platform:

sys.dbms_backup_restore.backupBackupPiece(
bpname => '&&1',
fname => '&&2/xtts_incr_backup',
handle => handle, media => media, comment => comment,
concur => concur, recid => recid, stamp => stamp, check_logical => FALSE,
copyno => 1, deffmt => 0, copy_recid => 0, copy_stamp => 0,
npieces => 1, dest => 0,
pltfrmfr => &&3); --attention here, it gets the platform id 
EXCEPTION
WHEN OTHERS
THEN
DBMS_OUTPUT.put_line ('ERROR IN CONVERSION ' || SQLERRM);
END ;

Also, XTTS applies those converted backups to the datafiles using
"sys.dbms_backup_restore.applyDatafileTo", "sys.dbms_backup_restore.restoreSetPiece" and sys.dbms_backup_restore.restoreBackupPiece".

So, it is still not answered why XTTS(+incremental backups) needs the tablespaces in source to be in READ-WRITE mode, but it is certain that, what XTTS method does is not a magic :) 

I mean, XTTS scripts just do a very good orchestration.. The perl scripts used in XTTS doesn't do conversion using perl capabilities. (note that: endian conversion can also be done using perl functions inside the perl.. ) 
It is actually a  good thing though. I mean, if XTTS scripts would use the perl itself to convert the files, then it will be more complicated right?

Anyways, this made me think that; this XTTS related conversion can even be made manually by executing the necessary rman commands, calling the necessary dbms_backup_restore calls and using exp/imp.. However, it would be a little bit complex, and there would be a support issue for that :)

Well.. That's it.. I just wanted to share this little piece of information, as I found it interesting.

One more thing, before finishing this :) -> 

I must admit that, the rman's convert capability in 12c seems very handy.. So being in 12C is not only good, as it is an up-to-date release (fixed bugs etc..), but it is also good for easing the migration approaches.

One last thing; XTTS method doesn't support compressed backupsets. So the backups used in XTTS must not be compressed backups.. ( else you get -> ORA-19994, "cross-platform backup of compressed backups to different endianess is not supported")

I will revisit this blog post, if I will find the answer for the question -> why does XTTS(+incremental) requires source tablespace to be in READ-WRITE mode? why is there such a restriction? Also you.. If you have an idea, please comment.

Sunday, September 16, 2018

EBS R12 (12.1) -- interesting behaviour of adpatch -- HOTPATCH Error -> "You must be in Maintenance Mode to apply patches"

There is an interesting behaviour of adpatch, that I wanted to share with you.
This behaviour of adpatch was observed in an EBS 12.1 environment, during an attempt for hot-patching.
What I mean by this interesting behaviour is, actually the exception that adpatch throws during an ordinary hotpatching session. I mean the error that adpatch returned -> "You must be in Maintenance Mode to apply patches"..

As you already know, in EBS 12.1, we can apply patches without enabling maitanence mode.
All we have to do is taking the risk :) and execute the adpatch command with options=hotpatch argument.
This is a very clear thing, that you already know. But what if we try to apply a regular patch(non-hotpatch) and fail just before applying our hotpatch?

As you may guess, adpatch will ask us the question "Do you wish to continue with your previous AutoPatch session [Yes] ?"?
So if we answer Yes and if our previous patch attempt wasn't for applying a hotpatch ( I mean if the previous patch was tried to be applied without the options=hotpatch argument), then the "options=hotpatch" will make adpatch confused.

At this point, adpatch will say "you are trying to apply a patch with options=hotpatch, but you didn't use "options=hotpatch" in your previous patching attempt. As you wanted to continue with your previous Autopatch session, I will take the value of the argument named options regarding your previous patching attempt."

Just after saying that, adpatch will check the previous patching attempt and it will see that the command that you used in the previous patching attempt was "adpatch"(options specified wasn't specificed)..
However; now you are supplying "options" as an argument..

At this point, adpatch will replace your options argument with "NoOptionsSpecified" . It is because you didn't used options argument in your previous patching attempt/session.
So, the adpatch command will become like "adpatch NoOptionsSpecified" .. Weird right? :) but true.. And I think this is a bug.. adpatch should properly handle this situation, but unfortuneatly, it is not able to do so.. Anyways; I won't go into the details..

Then, adpatch will try to apply the patch in question, and it will see the NoOptionsSpecified.

Then quest what? :)

adpatch will report a warning -> "Ignoring unrecognized option: "NoOptionsSpecified"."

So, it will ignore NoOptionsSpecified argument (options=hotpatch was already replace before) and normally it will stop and say -> " You must be in Maintenance Mode to apply patches. You can use the AD Administration Utility to set Maintenance Mode. "

What is lesson learned here? :)
-> after a failed adpatch session, don't say  YES to the question ("Do you wish to continue with your previous AutoPatch session"), if you want to apply a hotpatch just after a failed adpatch session (regular/non-hotpatch session.)

Here is a demo for you ->

[applr12@ermanappsrv  17603319]$ adpatch options=hotpatch
Your previous AutoPatch session did not run to completion.
Do you wish to continue with your previous AutoPatch session [Yes] ?
AutoPatch warning:
The 'options' command-line argument was not specified originally,
but is now set to:
"hotpatch"
AutoPatch will use the original value for 'options'.
AutoPatch warning:
Ignoring unrecognized option: "NoOptionsSpecified".

AutoPatch error:
You must be in Maintenance Mode to apply patches.
You can use the AD Administration Utility to set Maintenance Mode.

Tuesday, August 14, 2018

EBS R12 -- REQAPPRV ORA-24033 error after 12C DB upgrade /rulesets & queues

Encountered ORA-24033 in an EBS 12.1.3 environment.
Actually, this error started to be produced in workflow , just after upgrading the database of this environment from 11gR2 to 12cR1.

The database upgrade (running dbua and other stuff) was done by a different company, so that we were not able to check if it is done properly..
However; we were the ones who needed to solve this issue when it appeared :)

Anyways, functional team encountered this error while checking the workflows in Workflow Administrator Web Applications -> Status Monitor, and reported it as follows;


ORA-24033 was basically saying us, there is a queue-subscriber problem in the environment, so we started working with the queues, subscribers and the rulesets.

The analysis showed that, we had 1 ruleset and 1 rule missing in this environment..

select * from
dba_objects
where object_name like 'WF_DEFERRED_QUEUE%'

The following output was produced in a reference environment, on which workflow REQAPPRV was running without any problems.


The following output, on the other hand; was produced in this problematic environment.


As seen above, we had 1 ruleset named WF_DEFERRED_QUEUE_M$1 and 1 rule named WF_DEFERRED_QUEUE_M$1 missing in this problematic environment..

In addition to that, WF_DEFERRED related rulesets were invalid in this problematic environment.

In order to create (validate) these ruleset , we followed 2 MOS documents and executed our action plan accordingly.

Fixing Invalid Workflow Rule Sets such as WF_DEFERRED_R and Related Errors on Workflow Queues:ORA-24033 (Doc ID 337294.1)
Contracts Clause Pending Approval with Error in Workflow ORA-25455 ORA-25447 ORA-00911 invalid character (Doc ID 1538730.1)

So what we executed in this context was as follows;

declare
l_wf_schema varchar2(200);
lagent sys.aq$_agent;
l_new_queue varchar2(30);

begin
l_wf_schema := wf_core.translate('WF_SCHEMA');
l_new_queue := l_wf_schema||'.WF_DEFERRED';
lagent := sys.aq$_agent('WF_DEFERRED',null,0);
dbms_aqadm.remove_subscriber(queue_name=>l_new_queue, subscriber=>lagent);
end;
/
commit;

declare
l_wf_schema varchar2(200);
lagent sys.aq$_agent;
l_new_queue varchar2(30);

begin
l_wf_schema := wf_core.translate('WF_SCHEMA');
l_new_queue := l_wf_schema||'.WF_DEFERRED';
lagent := sys.aq$_agent('WF_DEFERRED',null,0);
dbms_aqadm.add_subscriber(queue_name=>l_new_queue, subscriber=>lagent,rule=>'1=1');
end;
commit;

declare

lagent sys.aq$_agent;
begin
lagent := sys.aq$_agent('APPS','',0);
dbms_aqadm.add_subscriber(queue_name=>'APPLSYS.WF_DEFERRED_QUEUE_M',
subscriber=>lagent,
rule=>'CORRID like '''||'APPS'||'%''');
end;
/

So what we did was to;

Remove and add back the subscriber/rules to the WF_DEFERRED queue 
+
Add the subscriber and rule back into the WF_DEFERRED_QUEUE_M queue.  (if needed we could remove the subscribe before adding it)

By taking these actions; the ruleset named WF_DEFERRED_QUEUE_M$1 and the rule named WF_DEFERRED_QUEUE_M$ were automatically created and actually, this fixed the ORA-24033 error in REQAPPRV :)

Monday, August 13, 2018

EBS -- MIGRATION // 2 interesting problems & 2 facts -- autoconfig rule (2n-1) & APPL_SERVER_ID in the plan.xml of ebsauth

Recently migrated a production EBS from an Exadata to another Exadata. That was an advanced operation, as it involved Oracle Access Manager(OAM), Oracle Internet Directory(OID) and 2 EBS disaster environments.,

This was a very critical operation, because it was not tested.. Moreover; we needed to do this work without any tests and we needed start working immediately..

The environment was as follows;

PROD : 1 Load Balancer, 2 Apps Nodes, 1 OAM/OID node and 2 Database nodes (Exadata)
-- Parallel Concurrent Processing involved as well..
Local Standby : 1 Apps Node, 2 Database nodes (Exadata)
Remote Standby: 1 Apps Nodes, 2 Database nodes (Exadata)

What we needed to do was; migrating the DB nodes of PROD, to Local Standby.
In order to do this; we followed the action plan below;

Note: actually we did much more than this, but this action plan should give you the idea :) 
  • stopped OAM+OID+EBSAccessGate + Webgate etc..
  • stopped EBS apps services which were running on both of the Prod Apps nodes.
  • Switched over the EBS Prod database to be primary in Local Standby.
  • Reconfigured local standby to be the new primary and configured it as the primary for the remote standby as well.
  • After switching the database over the standby site; we cleaned up the apps specific conf which was stored in the database (fnd_conc_clone.setup_clean)
  • We built context files (adbldxml.pl) and executed autoconfig on the new db nodes. 
  • Once db nodes were configured properly; we manually edited the apps tier context files and executed autoconfig on each of the apps tier nodes. (note that ; apps services were not migrated to any other servers)
  • We started the apps tier services.
  • We reconfigured the workflow mailer (its configuration was overwritten by autoconfig)
  • We logged in locally (without OAM) , checked the OAF , Forms and concurrent managers.
  • Everything was running except the concurrent manager which were configured to run in the second apps node. No matter what we did from the command line and from the concurrent manager administration screens, we couldn't fix it.. There was nothing written in the internal manager log, but the concurrent managers of node 2 could not be started.. 
    • The first fact : If you have a multi node EBS apps tier, AutoConfig has to be run '2n - 1' times. In other words;   for an application which has 'n' number of application nodes, AutoConfig has to be run '2n - 1' times so that the tnsnames.ora file on each node has FNDSM entries for all the other nodes. So, as for the solution, we executed autoconfig once more in the second node, and the problem dissapeared.
Reference: AutoConfig Does Not Populate tnsnames.ora With FNDSM Entries For All The Nodes In A Multi-Node Environment (Doc ID 1358073.1)
  • After fixing the conc managers, we continued with the OAM and OID.. We changed the datasource of the SSO (in weblogic) with the new db url and also changed dbc file there.. Then, we started Access Gate, Webgate, OAM and OID and checked the EBS login by using SSO-enabled url. But, the login was throwing http 404..
  • All the components of SSO (OAM,OID and everyting) was running.. But only the deployment named ebsauth_prod was stopped and it could not be started ( it was getting errors)
    • The second fact : if you changed the host of the EBS database and if your APPL_SERVER_ID was changed, then you need to redeploy the ebsauth by modifying it Plan.xml with the new APPL_SERVER_ID. Actually you have 2 choices; 1) Set the app_APPL_SERVER_ID to a valid value in the Plan.xml file for the AccessGate deployment and then restart the EAG Servers. The Plan.xml file location is specified on the Overview Tab for the AccessGate Deployment within the Weblogic Console where AccessGate is deployed. 2) Undeploy and redeploy AccessGate. 
Reference: EBS Users Unable To Sign In using SSO After Upgrading To EBSAccessGate 1234 With OAM 11GR2PS2 (Doc ID 2013855.1)
  • Well. after this move, SSO-enabled EBS login started to work as well. The operation was completed , and we deserved a good night sleep :)

Saturday, August 4, 2018

Oracle VM Server -- Guest VM in blocked state, VM console connection(VNC), Linux boot (init=/bin/bash)

This is an interesting one.. It involves an interesting way of booting Linux, dealing with Oracle Vm Server and its Hypervisor.

Last week, after a power failure in a critical datacenter, one of the production EBS application nodes couldn't be started.. That EBS application node was a VM running on a Oracle VM Server, and although Oracle VM Server could be started without any problem, that application node couldn't.

As, I like to administrate Oracle VM Server using xm commands, I directly jumped into the Oracle VM Server by connecting it using ssh (as root).

The repositories were there.. They were all accessable and xm list command were showing that EBS node, but its state was "b".. (blocked)

I did restart the EBS Guest VM couple of times, but it didn't help.. The EBS Guest VM was going into the blocked state just after starting.

Customer was afraid of it, as the status "blocked" didn't sound good...

However; the fact was that, it was normal for a Guest VM to be in blocked status if it doesn't do anything or let's say if it has nothing actively running on CPU.

This fact made me think that there should be problem during the boot process of this EBS Guest VM.

The OS installed on this VM was Oracle Linux, and I thought that, probably, Oracle Linux wasn't doing anything during its boot process.. Maybe it was asking something during its boot, or maybe it was waiting for an input..

In order to understand that, we needed to have a console connection to this EBS Guest VM..

To have a console connection, I modified the vm.cfg of this EBS Guest VM -- actually added VNC specific parameters to it.

Note that, in Oracle VM Server we can use VNC to connect to the Guest machines even during their boot process.

After modifying the vm.cfg file of the EBS Guest VM, I restarted this guest machine using xm commands and directly connnected to its console using VNC.

I started to watch the Linux boot process of this EBS Guest VM and then I saw it stopped..
It stopped, because it was reporting a filesystem corruption and asking us to run fsck manually..

So far so good.. It was as I expected..

The Oracle Linux normally was asking for the root password to be able to give us a terminal for running fsck manually. However; we just couldn't get the password.

So we were stuck..

We tried to ignore the fsck message of Oracle Linux, but then it couldn't boot..

We needed find a way.

At that time, I put my Linux admin hat on , and did the following;

During the boot, I opened the GRUB(GRand Unified Bootloader) menu. (bootloader)
Selected the appropriate boot entry (uek kernel in our case) in the GRUB menu and pressed e to edit the line.
Selected the kernel line and pressed e again to edit it.
Appended init=/bin/bash at the end of line.
Booted it.

By using the init=/bin/bash, I basically told the Linux kernel to run /bin/bash as init, rather than the system init.

As you may guess, by using init=/bin/bash, I booted the Linux and obtained a terminal without supplying the root password.

After this point, running fsck was a piece of cake :)

So I executed fsck for the root filesystem and actually for the other ones also.. Repaired all of them and rebooted the Linux once again..

This time, Linux OS of that virtualized EBS application node booted perfectly and the EBS application services on it could be started without any problems..

It was a stressful work but it made me have this interesting story :)

Thursday, July 26, 2018

Exadata -- Image & GRID 12.2 upgrade

You may remember my article on upgrading Exadata software versions. ->

Exadata Patching-- Upgrading Exadata Software versions / Image upgrade

This time, I 'm extending this upgrade related topic.
So, in this post, I 'm writing about Exadata Image upgrade + 12.2 GRID infrastructure upgrade.

Well... Recently we needed to upgrade Exadata software and GRID infrastructure versions of an Exadata environment.

We divided this work into 2 parts. First we upgraded Exadata images and then we upgraded the GRID version.

Both these upgrades were rolling upgrades. So the databases remained working during these upgrade activities.

Let's take a look at how we do these upgrades.

Exadata Images Upgrades:

We upgraded the image version of a production Exadata envrionment from 12.1.2.1.2.150617.1 to 12.2.1.1.4.171128. We did this work by executing 3 the main processes, given below;
  • Analysis and gathering info about the environment.
  • Pre-check
  • Upgrading the Images in order of ->
    • Exadata Storage Servers(Cell nodes) 
    • Infiniband Switches
    • Compute Nodes (Database nodes)
So, we execute the 3 main phases above and while executing these phases, we actually take the following 8 actions;

1) Gathering info and controlling the current environment :

Image Info, DB Home & GRID Home patch levels opatch lsinventory outputs, SSH equivalency  check , ASM diskgroup repair times check, NFS shares, crontab outputs, .bash_profile contents, spfile/pfile backups, controlfile traces

Approx. duration : 3 hours (done before the operation day)
2) Running the Exack:

Downloading the up-to-date exachk and running it with the -a argument.
After running the exachk -> analyzing its output and taking the necessary actions if there are any.

Approx. duration : 2 hours (done before the operation day) 

 3) Downloading the new Exadata images and uploading it to the nodes.

Approx. duration : 2 hours (done before the operation day)

4) Creating the necessary group files for the Patchmgr . (cell_group, dbs_group, ibswitches.lst)

Approx. duration : 0.5 hours (done before the operation day)

5) Running Patchmgr precheck. After analyzing its output-> taking the necessary actions  (if there are any) For ex: if there are 3rd party rpms, we may decide to remove them manually before the upgrade.

Approx. duration : 0.5 hours (done before the operation day)

6) Running Patchmgr and upgrading the images. (we do the upgrade in rolling mode)

Before running the patchmgr, we kill all the ILOM sessions.. (active ILOM session may increase the duration of the upgrade)

Note: Upgrade is done in the following order;

Exadata Storage Servers(Cell nodes)  (1 hour per node)
Infiniband Switches (1 hour per switch )
Compute Nodes (Database nodes) ( 1.5 hours per node)
  
7) As the post upgrade actions; reconfiguring NFS & crontabs. Also reinstalling the 3rd party rpms (if removed before the upgrade)

Approx. duration : 0.5 hours

8) Post check: checking the databases, their connectivity and alert log files..
Note that : we also run exachk once again and analyze its output to ensure that everything is fine after the Image upgrade.

Approx. duration : 1 hour

GRID 12.2 Upgrade:

As for  the GRID 12.2 upgrade, we basically follow the MOS document below;

"12.2 Grid Infrastructure and Database Upgrade steps for Exadata Database Machine running 11.2.0.3 and later on Oracle Linux (Doc ID 2111010.1)"

First, we analyze our environment in conjuction with following the document above to determine the patches and prereq patches required for our environment.

Here is the list of patches that we used during our last GRID 12.2 upgrade work;

GI JAN 2018 RELEASE UPDATE 12.2.0.1.180116 Patch 27100009 
Oracle Database 12c Release 2 Grid Infrastructure (12.2.0.1.0) for Linux x86-64 V840012-01.zip 
OPatch 12.2.0.1.0 for Linux x86-64 Patch 6880880 
Opatch 11.2.0.0.0 for Linux x86-64 Patch 6880880 
CSSD : DUPLICATE RESPONSE IN GROUP DATA UPDATE Patch 21255373

Once all the required files/patches are in place, we do the upgrade GRID by following the steps below;
  1. Creating the new GRID Home directories.
  2. Unzipping the new GRID software into the relevant directories.
  3. Unzipping up-to-date opatch and GRID patches.
  4. If needed, configuring the ssh equivalencies.
  5. Running runcluvfy.sh and doing the cluster verification. (In case of an error, we fix the error and rerun it)
  6. Patching our current GRID home with the prereq patches (in our last upgrade work, we needed to apply the patch 21255373)
  7. Increasing the sga_max_size and sga_target values of the ASM instances.
  8. Configuring VNC (we do the actual upgrade using VNC)
  9. Starting the GRID upgrade using the unzipped new GRID Software (on VNC)
  10. Running the roolUpgrade.sh on all the nodes.
  11. Controlling/Checking the cluster services.
  12. Configuring the ASM compatibility levels.
  13. Lastly, as a post upgrade step, we add the new GRID home in to the inventory.
As you may guess, the most critical steps in the list above, are step 9 and step 10..  (as the actual upgrade is done while executing those steps)

Approx Duration : 4 hours.. (for a 2 node Exadata GRID upgrade)

that's it :) I hope you will find this blog post useful :)

Friday, July 20, 2018

Exadata Cloud Machine -- first look, quick info and important facts

Recently started an ECM (Exadata Cloud Machine) migration project, or maybe I should say an ECC (Exadata Cloud at Customer) migration project.

This is a big migration project, including migration of the Core Banking databases.
It is a long run, but it is very enjoyable.
We have 2 ECCs to migrate to..

Finally, last week, initial deployment of the machines was completed by Oracle.
This week, we connected to the machines and started to play with them :)

I think, I will write several blog posts about this new toys in the coming months, but here is a quick info and some important facts about the ECC environments.

First of all, ECC is an Exadata :) Exadata hardware + Exadata software..

Technically, it is such a virtualized Exadata RAC environment, that we(consultants) and customers can not access its cells, Iloms, switches and hypervisor.

  • It is a Cloud Machine, but it is behind the firewall of the customer.
  • It has a Cloud Control Plane application , a GUI to manage the database services and this application is hosted in OCC (Oracle Cloud Machine), which can be thought as the satellite of ECC. 
  • We do lots of stuff using this GUI. Database Service Creation (11.2.0.4, 12c, 18c) , Patching and etc..


  • Database service creations and Grid operations are automatized. According to version of the database created using GUI, GRID is automatically created.. For ex: If we create a 12.2 database and if it is the first 12.2 database that we create in ECC, GRID 12.2 is also automatically created..(cloud operations) For ex: If we have GRID 12.1 and some DB 12.1 residing in ECC and if we want to create our first and new 12.2 Database, then GRID is automatically upgraded to 12.2 as well.
  • The minimum supported DB version in ECC is 11.2.0.4. So we need to have our db compatible parameter set to 11.2.0.4 (mimum) in order to have a database on ECC -- this is related with the migration operations.
  • We can install Enterprise Manager agents on ECC. So our customer can manage and monitor ECC nodes and databases using its current Enterprise Manager Cloud or Grid control.
  • ECCs are virtualized. Only Oracle can access the hypervisor level. We and the customer can only access to the DOMu. In the DOMu RAC nodes , we and the customer do the OS administration.. Backups, patching, rpm installation and everything.. Customer is responsible for the DOMu machines, where GRID and Databases run on. Customer has root access for the DOMu nodes. (This means DB administration + OS administration is still continuing :))
  • So customer can't access Cell servers, and even ILOM consoles..
  • Administration for everyting that resides below the DomU layer, is done by Oracle
  • Responsibility for everything that resides below DomU layer, is on Oracle.
  • Currently, for every physical node, we have a VM node. For ex: If we have a 1/2 ECC. We have 4 physical nodes and 4 VMs.. (DOM u nodes) -- 1 to 1.
  • We can create RAC multi-node or single node databases on ECC.
  • We can also create databases manually on ECC. (without using the GUI).. Using scripts or runInstaller, everything can be done just like the old days. (as long as versions are compatible with ECC)
  • If we create a 12C database using GUI, it comes as Pluggable.. So if we want to have a non-PDP 12C Database, we need to create it manually.
  • Customer can connect to the RAC nodes (DOMu nodes) using SSH keys. (without password).. This is a must.
  • Customer can install backup agents to ECC.. So without changing the current backup method and technology, customer can backup the databases running on ECC.
  • There is no external infiniband connection to ECC.. External connection can be max 10Gbit.
  • Enterprise Manager Express comes with ECC. We have direct links to Enterprise Manager Express in the Control plane.
  • IORM is also available on GUI. Using GUI, we can do all the IORM configuration.. 
  • In ECC, we can use In-memory and Active Dataguard .. Actually, we can use all the database options without paying any licenses.
  • If we create 12.2 Databases, they are created with TDE.. So TDE is a must for 12.2 databases on ECC.
  • However, we are not required to use TDE, if we are using 11G databases on ECC.
  • The ASM diskgroups on ECC are High redundancy Diskgroups. This is the default an can not be changed!
  • Exadata Image upgrade operations on the ECC environments are done by Oracle.

That'a all for now :) In my next blog post, I will show you how we can create database services on ECC. (using GUI)

Monday, July 16, 2018

RDBMS -- Be careful while activating a standby database (especially in cascaded configurations)

Recently, a customer reported an issue about a standby database, which was out-of-sync with the primary. This standby database was the endpoint of a cascaded configuration.

The cascaded dataguard configuration in that customer environment, was as follows;

Primary -> Standby1 -> Stanby2

So, the customer's requirement was to activate standby1 and continue applying redologs of primary directly to the standby2.

However; while activating, actually after activating the standby database named standby1, the customer accidentally made Standby2 to apply the redologs which were generated by standby1.

When standby2 received and applied the archivelogs from standby1, standby2 became a new standby database for standby1, and it became out of sync with the initial production database.

Interesting, right?

In order to bring the database Standby2 in sync with its original primary database, we did the following;

We used flashback database option to flashback the standby2 to the point before it applied the archivelogs from standby1

Then, we deleted the archivelogs received from standby1 and make sure that standby1 is not sending any archive logs to standby2 until it is converted back to physical standby. ( this way we could ensure that standby2 is applying the redologs only from the production database.)

Note that, if we didn't have the possibility to use the flashback option, we would have to recreate the standy database named standby2...

So, be careful while playing with the dataguard configuration.. Especially in cascaded environments... First check the configuration, then take the action.. In this real life case, the dataguard configuration was from primary to standby 1 and from standby 1 to standby 2.. So when standby1 became activated, that path "from standby1 to standby2" worked, and standby1 became the new primary for standby2.. Incarnation changed and standby2 became out-of-sync with the original primary. 
In order to prevent this to be happen, the dataguard flow(configuration) should have been changed before activating the standby1.