Monday, March 1, 2021

Oracle Linux - KVM -- VM network on Broadcom bond devices fail -- actually OS fails adding a Broadcom bnxt_en bond to a bridge

Recently dealed with a problem on an Oracle Linux KVM.. Customer was trying to implement Oracle Linux KVM using Oracle Linux Virtualization Manager (OLVM) , but failing in network configuration. OS was Oracle Linux 7.9 64 bit...

The issue was about virtual machine network .. That network could not be assigned to the relevant bond device using OLVM.. Bond device was configured with 2 slaves, and the configuration was correct, the bonding mode was appropriate and the slaved and the master (bond) were active in the OS layer.. But! somehow OLVM could not assign the vm network ( created by the customer using OLVM) to the relevant bond device. 

No errors were seen on OLVM, no errors in the OLVM logs (for instance, in engine.log), but I saw the following log messages on Oracle Linux syslod.. (/var/log/messages) ;

server01 kernel: VLAN2: port 1(bond1.10) entered blocking state

server01 kernel: VLAN2: port 1(bond1.10) entered disabled state

That VLAN2 shown in the logs above was actually a bridge.. As you may already know, when we have the vm network in the picture, we rely on the bridges on Linux layer..  So, it was clear that we had a bridge problem.. kernel was disabling the relevant path..

So this was the cause that prevents OLVM assigning vm network to the bond device.

When we tried to add that bond to that bridge, the following error was shown in the log;

server01 network: Bringing up interface bond1.10: can't add bond1.10 to bridge VLAN2: No data available

After doing some more analysis, I concluded that the problem wasn't on Oracle Linux KVM.. The problem should have been on Oracle Linux kernel or the device driver associated with the ethernet devices.. (in this case Broadcom bnxt_en)

With this in mind, I made more specific research and found similar bugs on Redhat.. 

In the redhat support,  I could see a bug, which had the exact similar symptoms ->  Bug 1860479 - Unable to attach VLAN-based logical networks to a bond..

The bug was recorded for Redhat 8 , but it seemed we had the same bug in Oracle Linux 7.9.. Actually, rather than the OS version, the kernel version was the key..

The fix was upgrading the kernel, but the workaround was downgrading it.. (according to Redhat). 
I was trying to get a quick win in this case, so I had to use a lower version kernel, than I decided to use Redhat compatible kernel instead of using the UEK kernel.. As you can imagine, the server was rebooted with the redhat compatible kernel (installed as an alternative kernel in Oracle Linux)  and problem solved! After booting with that kernel (a lower version kernel when compared to the UEK kernel), customer could assign the vm network to the relevant bond device using OLVM.

Note that, this bug appears when we configure the bond-slaves on 2 network ports belonging to 2 different Broadcom networks cards .. Bug doesn't appear when we configure the bond-slaves on the same network cards ..

That 's it .. I hope, you find this article useful.

No comments :

Post a Comment

If you will ask a question, please don't comment here..

For your questions, please create an issue into my forum.

Forum Link: http://ermanarslan.blogspot.com.tr/p/forum.html

Register and create an issue in the related category.
I will support you from there.