Categories
Networks Servers

Hyper-V Cluster Intermittent Issues

Last week we started getting very weird issues on the Hyper-V cluster at work where some VMs would randomly lose connectivity to other VMs. To put this in context we have around 130 active VMs at any one time and connectivity would be intermittent on around 20 of them. Of these 20 there would be connectivity loss from one VM to five or six others.

These issues started to look like just a general network issue as it only affected for VMs between hosts. A job was logged with Juniper our networking provider to take a look.

Junipers investigation led to the cabling we’d installed about a month ago. These cables are direct attached copper from fs.com with the Juniper specification programmed onto them. Juniper had decided that the type which was passive wasn’t supported by the EX4650s that we are using for the core switches. We needed the active ones. I couldn’t believe that Juniper didn’t support their “own” cabling in their Mist flagship switching. We replaced the cables and the intermittent issue got progressively worse as the day went on.

With my work being a college our busiest time of the year is right around the corner, Enrolment. The director wanted a fix and wanted it now.

I started re-investigating the logs on the Hyper-V hosts and noticed this log: about 10 times a second on two of our four hosts:

The MAC address 00-15-5D-AE-25-18 has moved from port 53A6AB28-EBD3-4F72-B690-69406077B8A7 (Friendly Name:.) to port 9A8780F9-D931-4F85-B087-397A2F718C01 (Friendly Name:.).

This made no sense as the LBFO Team has the Hyper-VPort load balancing algorithm configured on it so VMs shouldn’t change ports. I ran the following command to get the MAC addresses of the VMs:

Get-VM | Get-VMNetworkAdapter | Ft Name, MacAddresses

Once I found the VM I moved it across to another host. Monitoring the logs on the original host I noticed they were still being generated like five minutes later.

Now we were getting somewhere so we’d moved the VM with the MAC address associated to it but the MAC address was still there. My spidey sense started tingling and I began to think duplicate MAC addresses. Running the Get-VM command above again and there we go another VM with the same MAC address. Changing it reduced the problems but didn’t resolve them time to go hunting.

I found another couple of VMs with the same MAC address as each other but different to the original I was looking at. Changing one of those completely solved the problem.

So what happened? We have Microsoft’s Virtual Machine Manager to set up and manage all of the networking for the cluster. This management also makes use of MAC address pools to allocate to VMs. The VMM server has been broken for a couple of days and I think Hyper-V just decided to start changing MAC addresses when we were moving VMs about trying to fix the problem.

I hope this helps someone else who runs into this issue as researching that event. Hyper-V-VmSwitch event ID 25 literally yields no results in relation to intermittent connectivity failure.

Update

After some time we started to notice that other VMs were getting duplicate MAC Addresses again which is rather annoying. Investigating the issue we found that the duplicate MAC Addresses that were being assigned all started with 00:15:5D. As we believed the MAC Address pool was allocated through VMM we started investigating the pool to see what was going on.

On review, we found that VMM had a MAC Pool starting with 00:1D:D8 which clearly wasn’t the same as the MAC Addresses being allocated to the VMs which were being identified as duplicates. Looking into Hyper-V for each cluster node we found that each node had a MAC pool that matched 00:15:5D. Now we had a place to look. Every Hyper-V host in the cluster had the exact same pool 00:15:5D:1E:1B:00-FF. So looking at it further we now know that the MAC addresses were being allocated through Hyper-V instead of VMM. The question was why?

Simply put it was our backup system Arcserve. As this system is not properly cluster aware what it does to restore a VM is connect to the cluster name and pull back a host within the cluster, then connect to a Hyper-V host and perform a restore to that host. This in turn stops management of the VM from VMM meaning that Hyper-V was managing the MAC address allocation. We confirmed this by loading Failover Cluster Manager and checking a couple of VMs one which had SCVMM in its name and one which didn’t. Sure enough, the SCVMM VM had a MAC address starting with 00:1D:D8 and the one that didn’t had an address starting with 00:15:5D.

To understand why here is a link to another blog explaining how MAC Addresses are handled in VMM and Hyper-V environments: https://www.darrylvanderpeijl.com/hyper-v-vmm-mac-addresses/

Once we identified that we ended up modifying each host to have slightly different pools meaning any VM that got restored in the future shouldn’t get a duplicate address from Hyper-V as each host can use its internal algorithm to check for duplicate MAC Addresses.

Leave a comment