Monday, October 28, 2019

ODA X6-2 HA -- Virtualized // Network Architecture / Internal NFS and fault tolerant cabling

Recently dealed with a strange ODA issue.. The environment was a virtualized ODA X6-2 HA and the customer was complaining about the strange networking behaviour of the machine..
The issue appeared during a network H/A switch test...
That is, this ODA machine was connected to 2 network switches (good ones) to provide a full path network fault tolerancy.

As you may already know; this types, I mean ODA X6-2 HA machines have 2 x 4 port ethernet attached to each of their compute nodes. This makes a total of 8 ports, and when cabled correctly with to customer switches; it must supply at least something (in terms of fault tolerancy..)

Okay what do we expect from this kind of a configration?

Actually we expect a fault tolerancy against port failures, against cable failures and even against a switch failure..


Well.. as you may guess, this environment wasn't behaving as expected and that's why this issue was esclated to me...

Before giving you the solution (which is actually a basic one), I will try to give you the big picture...
What I mean by the big picture is the network architecture of a virtualized ODA.
I will give you that info because I see there is a misconception there.

This kind of a misconception may be expected actually.. Because, these machines have an architecture that contains lots of things.. Not only the hardware-related things, but also the software-related things (installations and configurations).

Although the ODA environments can be installed and configured easily. There are Infiniband cards, private networks, NFS shares between Domains, ASM , ACFS, OVM , Shared repositories, bonding, eth devices, virtual networks and virtual bridges installed and configured on them.

Let's start with the interconnect and with the internal NFS which is exported from the ODA_BASE to DOM 0.

The interconnect is based on infiniband.. (it is according to your order ofcourse, but most of the time it is based on the infiniband.)

So the RAC interconnect is based on the infiniband and the cluster_interconnects parameter should be set accordingly for the RAC databases running on the privileged ODA_BASE domain.

NFS shares, that store the shared repositories and the virtual machines files are a little interesting.
ODA_BASE reaches the shelf, actually the disks and creates ACFS/ASM filesystem on top of them, then it shares these filesystems to tDOM0.

So, yes these NFS shares are exported from a child domain (ODA_BASE) to the main domain/hypervisor level (DOM 0).

The role of these shares are very important.. They basically store the virtual machine files and this  means when they are off, the virtual machines are off .

By the way, these NFS shares are not based on the infiniband!
These NFS shares are not based on the ethernet as well..
These NFS shares are acutally based on the virtual bridges, a logical network between ODA_BASE and DOM0.. So this means, the network traffic on these NFS shares are based on memory copy. No physical devices are in the path..

Note that, the virtual bridges are in memory, they are codes, not physical network hardwares.
There is no physical media (ie. Ethernet cable) when we talk about these types of bridges. So their capabilities are based on limits of the CPUs moving the packets between the virtual nics. This configuration should give the best performance.. However; we need to take NFS and TCP traffic-related bottlenecks into the account as well..
In other words; even if this traffic is based on TCP and NFS in the upper layers, this traffic is actually based on memory operations between the VM domains in the background.

There is another thing to be aware of..
These NFS shares are running on the virtual ip addresses (VIPs). These VIP addresses are dynamic cluster resources, as you know.

So, if ODA_BASE fails on node1, the VIP of ODA_BASE node 1 is transferred to ODA_Base node2. However; at this time, the routes change.. In this case, the NFS traffic between node2 and the DOM 0 node 1 goes through the infiniband. (route outputs of DOM0 and ODA_BASE machines support this  info as well.)



Yes, by knowing this, we can say that even when there is problem in the infiniband or ethernet-based network of ODA, the NFS shares (which are very crucial), will continue to work!. The VM machines created on ODA will keep running..
  
Okay, lets take a look at the ethernet side of this.

As mentioned in the beginning of this blog post, there are 4 ethernet ports in each one of the ODA nodes.

When we look from the DOM 0 perspective, we see that in each ODA node we have eth0, eth1,eth2 and eth3.

These devices are paired and bonded. (in active-backup mode in Dom0 )

eth0 and eth1 are the slaves of bond0.
eth1 and eth2 are the slaves of bond1
bond0 is mapped to the virtual bridge named net1,  the virtual bridge named net1
bond1 is mapped to the virtual bridge named net2, and that net1 is mapped to ODA_BASE
bond1 is mapped to the virtual bridge named net2, and that net2 is mapped to ODA_BASE

ODA_BASE machines by default uses net1, so in the background they use -> bond0 in Dom0 , eth0 and eth1 in Dom0

net2 is not configured by default, so we must do an additional configuration for making use of it.

In order to do this configuration, we can use oakcli configure additionalnet or we can use a manual method, / editing the ifcfg files manually.

Anyways, suppose we have 4 port cabled and eth1 is enabled on ODA_BASE. Suppose we  have 2 switches in front of ODA and we connect our cables to these switches randomly. (2 cables to each of the switches per an ODA node)... So now, are we in the safe side? Can we  survive a port-failure , can we survive a switch failure?

Well that depends on the cabling.. This is for sure as this is the cause that made me write this blog post..

That is, we need to have some sort of cross cabling...

We need to ensure that eth0 (DOM0) of a Node and eth1 of that Node (DOM0) should be connected to the different switches.

If we cable them into the same switch, then we can't survice a switch failure, because our bond will be down.. And if it is bond0, we will lose all the databases traffic in ODA_BASE of the relevant node..

There is RAC yes! RAC makes us survive even from this kind of a disaster, but why  creating an extra job for RAC ?  Why losing our active resource because of a network problem:) ?
Bytheway, even in RAC, the application design (TAF and etc..) should be properly done in order to have a zero-downtime in the application layer.

Anyways, the cause that made me write this blog was an incorrect cabling and here I m giving you the correct cabling for eth devices of ODA nodes.


No comments :

Post a Comment

If you will ask a question, please don't comment here..

For your questions, please create an issue into my forum.

Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html

Register and create an issue in the related category.
I will support you from there.