Tuesday, February 9, 2021

OVM - Changing Cluster Heartbeat nw, Questions related to High availability & DR, Comments on bond device problems - LACP activity unknown , LACP suspend-individual

Recently I was asked 4 interesting questions. Questions about OVM... In this blog post, I will share my answers and comment for these questions.. This might help for OVM users.

The first question was about OVM clusters, about cluster heartbeat nework.. Specifically about changing the heartbeat network and having more than one cluster heartbeat network.

This operation is a little costly. Requires VM shutdown etc. If there is an available port in the environment, it can be added to the corresponding heartbeat bond. What is important here is that, both port and switch redundancy in the relevant network path...

We shared the Oracle Support notes below.. These notes may be followed to change the Cluster Heartbeat network;

  • OVM - How to Migrate Cluster HeartBeat Network Channel To A Different Bonded Interface Of An Existing Deployment (Doc ID 2408148.1)
  • How to Move the Heartbeat from One Network to Another in Oracle VM (Doc ID 1995619.1)
  • How to Move the Heartbeat from One Network to Another in OVM3 (Doc ID 1504140.1) – This method may not work on OVM 3.4.. The other documents above seems more promising.

About having a heartbeat channel with more than one network;

We shared that; there was a change on OVVM 3.3.x that prevented having multiple networks with the cluster heartbeat role on a single server. So adding additional networking to the Cluster Heartbeat will probably not work. Add job will get an error like the following;->

“Cannot add Ethernet device: eth1 on oraclevm, to network: hearbeat, because server: oraclevm, already has cluster network”. 

Even if VM Manager shows added, we think it will not be added in the background .. Of course you can test. 

Second question was related with the general steps for having a DR for OVM environments.

As for implementing DR, we shared the following ateam document.. Method explained in that document is applicable for this task (ocfourse if you have the necessary infrastructure to support the methods that is used in that document)

https://www.ateam-oracle.com/oracle-vm-storage-repository-replication-for-on-premise-fusion-applications-disaster-recovery

Third question was about the reason why auto switchover not work.. I mean when one of the VM nodes in an OVM cluster crashes or fails, the relevant VM guests do not switch to the other standing node (they don't start from the other node, although we see that the server pool is clustered)

Analysis showed that this situation was an expected behavior. Although there was an OVM cluster, the Enable High Availability checkbox was not selected for the relevant VMs.

As for the solution; we recommended that the relevant checkbox be marked for the relevant VMs.

Following diagram shows that decision mechanism clearly..

Ref:  https://docs.oracle.com/cd/E27300_01/E27309/html/vmusg-svrpool-ha.html

To automatically configure the server pool cluster and enable HA in a server pool, select the Clustered Server Pool check box when you create or edit a server pool. See Section 6.7, “Creating a Server Pool” and Section 6.8.3, “Editing a Server Pool” for more information on creating and editing a server pool.

To enable HA on a virtual machine, select the Enable High Availability check box when you create or edit a virtual machine. See Section 7.7, “Creating a Virtual Machine” and Section 7.9.2, “Editing a Virtual Machine” for more information on creating and editing a virtual machine.


The last question was about bonding-LACP configuration.. The problem was a suspended port.. The port to which one of the slaves of a bond with a Bonding-LACP configuration was suspended..

Error messages in the logs on the switch side -> "LACP activity unknown", "LACP suspend-individual".

Bonding configuration and syslogd messages of OVM have been checked. There was no misconfiguration.. There wasn't any log record that could directly cause this problem.

The problem was solved by restarting the eth devices of the related bonds that were suspended. (example commands: ifdown eth7, ifup eth7)

After these moves, we observed that the ports of those interfaces were not suspended again.

We recommended the following;

In case problem reoccurs, a detailed analysis on the switch side should be done. Disabling the "LACP suspend-individual" setting should also be considered.

This setting is already shown disabled in some MOS notes related to Exadata ;

* Configure Exadata X8M Backup 40gbe ZS5-2 and ZS5-4 (Doc ID 2698913.1)
* Set Up and Configure Exadata X8M Backup with ZFS Storage ZS7-2 (Doc ID 2635423.1)

This environment was not an Exadata, but OVM is similar to Oracle Linux, which is what we have in Exadata.

Also, Redhat have some documents on the same subject, recommending the same solution ->

* LACP Linux Bonds not working properly on Cisco Nexus 9000-series switches

https://access.redhat.com/solutions/3702541

* Resolution: Disable LACP suspend-individual on the Cisco access port

As far as we can see, some switches also have these suspend-individual bugs.

At the end of the day,  those suspend individual messages didn't appear again.. We suggested  the setting -> "no lacp suspend-individual" on the switch side.. ( if the problem reappears in the future)

No comments :

Post a Comment

If you will ask a question, please don't comment here..

For your questions, please create an issue into my forum.

Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html

Register and create an issue in the related category.
I will support you from there.