Monday, March 12, 2018

Exadata Patching-- Upgrading Exadata Software versions / Image upgrade

Recently completed an upgrade work in a critical Exadata environment.
The platform was an Exadata X6-2 quarter rack and our job was to upgrade the image versions of Inifiniband switches, Cell nodes and Database nodes. (This is actually called Patching Exadata)

We did this work in 2 iterations. Firstly in DR and secondly, in PROD.
The upgrade was done with the rolling method.

We needed to upgrade the Image version of Exadata to 12.2.1.1.4. (It was 12.1.2.3.2, before the upgrade)


Well.. Our action plan was to upgrade the nodes in the following order:

InfiniBand Switches
Exadata Storage Servers(Cell nodes)
Database nodes (Compute nodes)

We started to work by gathering the gathering info about the environment.

Gathering INFO about the environment:
------------------------------------------

Current image info: we gathered this info by running imageinfo -v on each node including cells.  we expected to see same image versions on all nodes.

Example command:

root>dcli -g /opt/oracle.SupportTools/onecommand/dbs_group -l root "imageinfo | grep 'Image version'"   --> for db nodes
root>dcli -g /opt/oracle.SupportTools/onecommand/cell_group -l root "imageinfo | grep 'Image version'"  --> for cell nodes

In addition, we could check the image history using imagehistory command as well..

DB Home and GRID Home patch levels: We gathered opatch lsinventory outputs. (just in case)

SSH equivalency : We checked the ssh equivalency, from db node 1 to all the cells, from db node1 to all infiniband switches, from db node2 to db node1 . (we used dcli to check this)

Example check:

with root user>
dcli -g cell_group -l root 'hostname -i'

ASM Diskgruop repair times: We checked whether the repair times are lower than 24h, we noted them to be increased to 24h.(just before upgrade of the cell nodes)

We used v$asm_disk_group & v$asm_attribute 

Query for checking:
SELECT dg.name,a.value FROM v$asm_diskgroup dg, v$asm_attribute a WHERE dg.group_number=a.group_number AND a.name='disk_repair_time';

Setting the attributes: before the upgrade:
ALTER DISKGROUP diskgroup_name SET ATTRIBUTE 'disk_repair_time'='24h';
Setting the attributes back to their original values: after the upgrade:
ALTER DISKGROUP diskgroup_name SET ATTRIBUTE 'disk_repair_time'='3.6h';

ILOM connectivity: We checked ILOM connectivity using ssh from db nodes to ILOM.. We checked using start /SP/console.. (again not web based, over SSH)

profile files (.bash_profile etc..) : We checked .bash_profile and .profile files, we removed the custom lines removed from those file.. (before the upgrade)

After gathering the necessary info, we executed the Exachk and concantrated on its findins:

Running EXACHK:
------------------------------------------

We first checked our exachk version using "exachk -v" and we checked if it is in the most up-to-date version.. In our case, it wasn't. So we downloaded the latest exachk using the link given in the document named : "Oracle Exadata Database Machine exachk or HealthCheck (Doc ID 1070954.1)"

In order to run the exachk, we unziped the donwloaded exachk.zip file. We put it under /opt/oracle.SupportTools/exachk directory.

After downloading and unzipping; we run exachk using "exachk -a" command as the root user. ("-a" means Perform best practice check and recommended patch check. This is the default option. If no options are specified exachk runs with -a)


Then we checked the output of exachk and take the corrective actions if necessary.
After the exachk, we continued with downloading the image files.

Downloading the new Image files:
------------------------------------------
All the image versions and links to the patches were documented in "Exadata Database Machine and Exadata Storage Server Supported Versions (Doc ID 888828.1)"

So we opened the document 888828.1 and checked the table for "Exadata 12.2" . (as our Target Image Version was 12.2.1.1.4)
We downloaded the patches documented there..

In our case, following patches were downloaded;

Patch 27032747 - Storage server and InfiniBand switch software (12.2.1.1.4.171128)   : This is for Cells and Infiniband switches.

--note that, in some image versions infiniband switch software is delivered apart from the Storage server patches.. For ex: This is the case for 19.3.1.0.0..
--also note that, in some cases, you need to upgrade your infiniband software version to a supported release before upgrading it to your target release.

Patch 27103625 - x86-64 Database server bare metal / domU ULN exadata_dbserver_12.2.1.1.4_x86_64_base OL6 channel ISO image (12.2.1.1.4.171128)  : This is for DB nodes.

Cell&Infiniband patch was downloaded to DB node1 and unzipped there.  (SSH equiv is required between DB node1 and all the Cells + all the infiniband switches) (can be unzipped in any location)

Database Server patch was downloaded to DB node1 and DB node2 (if the Exa is 1/4 or 1/8 ) and unzipped there. (can be unzipped in any location)


Note: downloaded and unzipped patch files should belong to root user..

After downloading and unzipping the Image patches, we created the group files..

Creating the group files specifically for the image upgrade:
------------------------------------------

In order to execute patchmgr , which was the tool that makes the image upgrade, we created files dbs_group, cell_group and ibswitches.lst.
We placed these files on db node1 and db node2.

cell_group files : contains the hostnames of all the cells.
ibswitches.lst files : contains the hostnames of all the infiniband switches.
dbs_group file on DB node 1: contains the hostname of only DB node2

dbs_group file on DB node 2: contains the hostname of only DB node1

At this point, we were on a important as our upgrade  was almost beginning.. However, we still had an important thing to the and that was the Precheck..

Running Patchmgr Precheck (first for Cells, then for Dbs, lastly for Infiniband Switches -- actually, there was no need to follow an exact sequence for this): 
------------------------------------------

In this phase, we run patchmgr utility with precheck argument to check the environment before the patchmgr based image upgrade.
We used patchmgr utility that comes with the downloaded patches.
We run these checks using root account.

Cell Storage Precheck: (we run it from dbnode1 , it then connects to all cells and do the check..)

Approx Duration : 5 mins total

# df -h (check disk size, 5gb free for / is okay.)
# unzip p27032747_122110_Linux-x86-64.zip
# cd patch_12.2.1.1.4.171128/
# ./patchmgr -cells cell_group -reset_force
# ./patchmgr -cells cell_group -cleanup

# ./patchmgr -cells cell_group -patch_check_prereq -rolling

Database Nodes Precheck: (we run it from dbnode1 and dbnode2, so each db node is checked seperately.. This was because our dbs_group files contains only one db node name..)

Approx Duration : 10 mins per db.

        # df -h (check disk size, 5gb free for / is okay.)
# unzip p27032747_122110_Linux-x86-64.zip
# cd patch_12.2.1.1.4.171128/
# ./patchmgr -dbnodes dbs_group -precheck -nomodify_at_prereq -log_dir auto -target_version 12.2.1.1.4.171128 -iso_repo <patch>.zip

Infiniband Switches: # ./patchmgr -ibswitches ibswitches.lst -upgrade -ibswitch_precheck

Note that, while doing the database precheck, we used nomodify_at_prereq argument to make patchmgr not to delete custom rpms automatically during its run.

So, when we used nomodify_at_prereq, patchmgr created a script to delete the custom rpms .. This script was named /var/log/cellos/nomodify*.. We could later(just before the upgrade) run this script to delete the custom rpms. (we actually didn't used this script, but deleted the rpms manually one by one :)

Well.. We reviewed the patchmgr precheck logs. (note that we ignored custom rpm related errors, as we planned to remove them just before the upgrade)

Cell precheck output files were all clean.. We only saw the a LVM related error in the database node precheck outputs.

In precheck.log file of db node 1, we had - >

ERROR: Inactive lvm (/dev/mapper/VGExaDb-LVDbSys2) (30G) not equal to active lvm /dev/mapper/VGExaDb-LVDbSys1 (36G). Backups will fail. Re-create it with proper size.

As for the solution: we implemented the actions documented in the following note. (we simply resized the lvm)

Exadata YUM Pre-Checks Fails with ERROR: Inactive lvm not equal to active lvm. Backups will fail. (Doc ID 1988429.1)

So, after the precheck, we were almost there :) just... we had to do one more thing;

Discover environmental additional configurations and take notes for disabling them before the DB image upgrade:
------------------------------------------

We checked the existence of customer's NFS shares and disabled them before db image upgrade.
We also checked the existence of customer's crontab settings and disabled them before db image upgrade.

These were the final things to do , before the upgrade commands..
So, at this point, we actually started executing the upgrade commands;

Running "Patchmgr for the upgrade" (first for infiniband switches, then for Cells,  lastly for Dbs)
------------------------------------------

Upgrading  Infiniband switches : (we run it from dbnode1, it then connects to all infiniband switches and do the upgrade, the job is done in rolling fashion)

Note: Infiniband image versions are normally different than the cell & db image versions. This is because Infiniband switch is a switch and its versioning is different than cells and db nodes.
.. 
Note: We could get a list of the inifiband switches using the ibswitches command (we run it from db nodes using root)  

We connected to Db node 1 ILOM (using ssh)
We run command start /SP/console
Then, with root user -> we changed our current working directory to the directory where we unzipped the Cell Image patch.

Lastly we run (with root) -> # ./patchmgr -ibswitches ibswitches.lst -upgrade (approx : 50 mins total)

Upgrading Cells/Storage Servers :  (we run it from dbnode1, it then connects to all the cell nodes and do the upgrade.. The job was done in rolling fashion)

We connected to Db node 1 'sILOM (using ssh)
We run command start /SP/console
Then, with root user -> we changed our current working directory to the directory where we unzipped the Cell Image patch.
Lastly we run (using root account) ->

# ./patchmgr -cells cell_group -patch -rolling   (approx : 90 mins total)

Alternatively, we can directly run the command without connecting to ILOM.. In this case, we use nohup..

# nohup  ./patchmgr -cells cell_group -patch -rolling &

This command was run from the DB node 1 and it upgraded all the cells in one go.. Rebooted them one by one , etc.. There was no downtime in the database layer.. All the databases were running during this operation.

After this command completed successfully, we cleaned up the temporary file with the command :
# ./patchmgr -cells cell_group -cleanup

We checked the new image version using imageinfo & imagehistory commands on cells and continued with upgrading the database nodes.

Upgrading  Database Nodes : (must be executed from node 1 for upgrading node 2 and from node 2 for upgrading node 1, so it is done with 2 iterations -- we actually choosed this method..).

During these upgrades, database nodes are rebooted automatically. In our case, once the upgrade was done, databases and all other services were automatically started.

We first deleted the rpms (note that, we needed to reinstall them after the upgrade)

We disabled the custom crontab settings.
We unmounted the custom nfs shares. (we also disabled nfs-mount-related lines in the relevant configuration files , for ex: /etc/fstab, /etc/auto.direct)

--upgrading image of db node 2

We connected to Db node 1 ILOM (using ssh)
We run command start /SP/console
Then, with root user -> we changed our current working directory to the directory where we unzipped the Database Image patch.

Important note: Before running the below command, we modified the dbs_group.. At this phase, dbs_group should only include db node 2's hostname. (as we upgraded nodes one by one and we were upgrading the db node 2 first -- rolling)

Next, we run (with root) ->

# ./patchmgr -dbnodes dbs_group -upgrade -log_dir auto -target_version 12.2.1.1.4.171128 -iso_repo <patch>.zip   (approx: 1 hour)

Once this command completed successfully, we could say that, Image upgrade of db node 2 was finished.

--upgrading image of db node 1

We connected to Db node 2 ILOM (using ssh)
We run command start /SP/console
Then, with root user -> we changed our current working directory to the directory where we unzipped the Database Image patch.

Important note: Before running the below command, we modified the dbs_group.. At this phase, dbs_group should only include db node 1's hostname. (as we were upgrading nodes one by one and as we already upgraded db node 2 and this time, we were upgrading db node 1. -- rolling)

Next we run (with root) ->

# ./patchmgr -dbnodes dbs_group -upgrade -log_dir auto -target_version 12.2.1.1.4.171128 -iso_repo <patch>.zip (approx: 1 hour)

Once this command completed successfully, we could say that, Image upgrade of db node 1 was finished.

At this point, our upgrade was finished!!

We re-enabled the crontabs, remounted the NFS shares, reinstalled the custom rpms and started testing our databases.

Some good references:
Oracle Exadata Database Machine Maintenance Guide, Oracle.
Exadata Patching Deep Dive, Enkitec.

No comments :

Post a Comment

If you will ask a question, please don't comment here..

For your questions, please create an issue into my forum.

Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html

Register and create an issue in the related category.
I will support you from there.