How It Started

I screwed up a vCenter instance. Actually it is pretty easy to screw up the state-of-the-art hypervisor controller from its beautifully designed web UI, using the appealing buttons that always have been there. The process only requires 2 simple steps:

Enable vCenter HA
Replace the machine SSL certificate

The vCenter HA documentation do state “if you want to use custom certificates, you have to remove the vCenter HA configuration” using the smallest font size possible, but the warning is nowhere mentioned in the documentation related to replacing SSL certificates where it should be. The UI won’t stop you from playing with fire, either.

If you have enough time and a lab environment then give it a try. The vCenter VM will reboot a few times before it completely stops working. It will still spin up, but you won’t be able to login anymore. You’ll see a very unhelpful error message on the login screen:

An error occurred when processing the metadata during vCenter Single Sign-On setup - Failed to connect to VMware Lookup Service

By the way, don’t bother trying the vSphere Certificate Manager command-line tool to unscrew the situation; that tool will refuse to do anything if it detects itself running in a HA vCenter cluster. So, if you don’t have any backup or snapshot to revert to, your vCenter is dead.

Things were a little complicated for my case: The dead vCenter VM ran on a 3-node hyperconverged cluster with HA, DRS and vSAN. As the vCenter goes down, now I have a problem.

How It’s Going

Luckily, the ESXi hypervisor is largely independent from vCenter, so I could still log in and do something on the individual hypervisors. Now I had to do something to (hopefully) make the situation better.

Preparing

The first obvious thing I did was to shut down the old vCenter VMs. These does not work anymore and might interfere with the recovery process.

Next thing I did was to do a backup of all important data on the cluster. Backing up an ESXi hypervisor is easy: mount some NFS storage on each hypervisor, and manually move/copy the VMs over. vMotion wouldn’t be available so everything had to be done by hand, when the VMs were shut down.

Then I shut down as many VMs as I can. Although there was possibility that one can rebuild the cluster while keeping some VMs running, I recommend against that.

Prepare a vCenter installer ISO on the workstation, and let’s get into the recovery process.

The First (Unsuccessful) Attempt

Being rather unfamiliar with the new vSphere 7.0, initially my strategy was to just reinstall the vCenter directly onto the vSAN storage, take over the hosts, rebuild the distributed switch by hand, and simply re-configure the cluster. The process did not work: while adding the first host, vCenter reported “Found host(s) esxi02.corp.contoso.com, esxi03.corp.contoso.com participating in the vSAN service which is not a member of this host’s vCenter cluster”, and after a few seconds, vCenter freezed. Later investigation showed that vCenter detected the host had vSAN configured, so it overwrote a single-node vSAN configuration onto that host, breaking the storage it was running on.

Now I have 2 problems: a dead vCenter, and a 3-node vSAN cluster in a split-brain situation.

The Second (Successful) Attempt

Knowing that vSAN won’t automatically delete any inaccessible/broken object, I was confident that all my data was still there, it was just the vSAN configuration that need to be fixed to at least keep the storage running. After some searching on the Internet, I found out that you can actually manage all vSAN configuration on the ESXi hypervisor host! There are some not-very-helpful official documentation on the esxcli vsan subcommand, but it was enough to get me on the correct track.

I enabled SSH on all the hosts, and issued this command to every host:

esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates

This essentially told the vSAN agent running on every host to ignore everything sent by any vCenter. Now that the “manual transmission” mode is engaged, I started to recover the vSAN.

First let’s confirm the status:

[root@esxi01:~] esxcli vsan cluster list
Cluster Information of 3a02d572-728d-482b-a94d-2245a6ec99d1
   Enabled: true
   Current Local Time: 2020-10-29T07:05:18Z
   Local Node UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec
   Sub-Cluster Backup UUID:
   Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1
   Sub-Cluster Membership Entry Revision: 0
   Sub-Cluster Member Count: 1
   Sub-Cluster Member UUIDs: 9f7326ad-f815-45b1-a809-ece25fddc7ec
   Sub-Cluster Member HostNames: esxi01.corp.contoso.com
   Sub-Cluster Membership UUID: 665dbc18-5bde-4cb6-a510-7c5185c78f3d
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: dadf3e7c-8162-4815-9d02-08af4d8c4c7b 2 2020-10-29T06:29:11.652

[root@esxi03:~] esxcli vsan cluster list
Cluster Information of 3a02d572-728d-482b-a94d-2245a6ec99d1
   Enabled: true
   Current Local Time: 2020-10-29T07:09:48Z
   Local Node UUID: 67874ba3-8fd5-463f-80fb-6a82910c5ff2
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 67874ba3-8fd5-463f-80fb-6a82910c5ff2
   Sub-Cluster Backup UUID: 04e3bd93-2846-4474-bae7-e16b602e316f
   Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1
   Sub-Cluster Membership Entry Revision: 2
   Sub-Cluster Member Count: 2
   Sub-Cluster Member UUIDs: 67874ba3-8fd5-463f-80fb-6a82910c5ff2, 04e3bd93-2846-4474-bae7-e16b602e316f
   Sub-Cluster Member HostNames: esxi03.corp.contoso.com, esxi02.corp.contoso.com
   Sub-Cluster Membership UUID: 3b5c9a5f-3063-68bb-eafc-0c42a1719576
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: dd0af2e3-d7e0-4407-9a50-d87be61513b3 9 2020-10-22T08:59:00.661

We indeed had a split brain. Then kick esxi01 out of the imaginary one-node cluster (it is a very slow process, have some patience), and re-join it with the correct sub-cluster UUID from the other hosts' config:

[root@esxi01:~] esxcli vsan cluster leave
[root@esxi01:~] esxcli vsan cluster join -u 3a02d572-728d-482b-a94d-2245a6ec99d1
[root@esxi01:~] esxcli vsan cluster list
Cluster Information of 3a02d572-728d-482b-a94d-2245a6ec99d1
   Enabled: true
   Current Local Time: 2020-10-29T07:09:55Z
   Local Node UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec
   Sub-Cluster Backup UUID:
   Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1
   Sub-Cluster Membership Entry Revision: 0
   Sub-Cluster Member Count: 1
   Sub-Cluster Member UUIDs: 9f7326ad-f815-45b1-a809-ece25fddc7ec
   Sub-Cluster Member HostNames: esxi01.corp.contoso.com
   Sub-Cluster Membership UUID: ab6a9a5f-2401-89af-99aa-0c42a171e24e
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: None 0 0.0

A vCenter configured vSAN cluster would be in the unicast mode (i.e. peer discovery depends on the IP list sent by the control plane), so we also need to synchronize the IP address list of the cluster on every host. Verify the VMKernel adapter for vSAN is set up on esxi01:

[root@esxi01:~] esxcli vsan network list
Interface
   VmkNic Name: vmk2
   IP Protocol: IP
   Interface UUID: 699fe1e6-eaba-49db-9d04-8859ed2b066f
   Agent Group Multicast Address: 224.2.3.4
   Agent Group IPv6 Multicast Address: ff19::2:3:4
   Agent Group Multicast Port: 23451
   Master Group Multicast Address: 224.1.2.3
   Master Group IPv6 Multicast Address: ff19::1:2:3
   Master Group Multicast Port: 12345
   Host Unicast Channel Bound Port: 12321
   Data-in-Transit Encryption Key Exchange Port: 0
   Multicast TTL: 5
   Traffic Type: vsan

If you don’t see “vsan” traffic type in the output, reconfigure your VMKernel adapter. Since esxi02 and esxi03 already know each other, we can concentrate the list from the 2 hosts…

[root@esxi02:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address       Port  Iface Name  Cert Thumbprint                                              SubClusterUuid
------------------------------------  ---------  ----------------  --------------  -----  ----------  -----------------------------------------------------------  --------------
67874ba3-8fd5-463f-80fb-6a82910c5ff2          0              true  192.168.1.201  12321              73:F4:93:D8:D8:2A:C0:D3:4F:A6:DF:4D:3D:BE:34:8C:15:D9:45:52  3a02d572-728d-482b-a94d-2245a6ec99d1
9f7326ad-f815-45b1-a809-ece25fddc7ec          0              true  192.168.1.215  12321              05:B1:CF:D5:09:6A:05:7C:D7:C4:69:69:7A:85:04:90:51:D4:9A:D6  3a02d572-728d-482b-a94d-2245a6ec99d1

[root@esxi03:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address       Port  Iface Name  Cert Thumbprint
                        SubClusterUuid
------------------------------------  ---------  ----------------  --------------  -----  ----------  -----------------------------------------------------------  --------------
9f7326ad-f815-45b1-a809-ece25fddc7ec          0              true  192.168.1.215  12321              05:B1:CF:D5:09:6A:05:7C:D7:C4:69:69:7A:85:04:90:51:D4:9A:D6  3a02d572-728d-482b-a94d-2245a6ec99d1
04e3bd93-2846-4474-bae7-e16b602e316f          0              true  192.168.1.160  12321              6D:E4:62:CA:FB:17:96:41:97:F4:22:B9:8F:D8:B2:5E:93:0F:79:0D  3a02d572-728d-482b-a94d-2245a6ec99d1

then play them back onto esxi01 (if you have vSAN witness applications, you need to slightly change the arguments here):

[root@esxi01:~] esxcli vsan cluster unicastagent add -a 192.168.1.201 -U true -u 67874ba3-8fd5-463f-80fb-6a82910c5ff2 -t node
[root@esxi01:~] esxcli vsan cluster unicastagent add -a 192.168.1.160 -U true -u 04e3bd93-2846-4474-bae7-e16b602e316f -t node
[root@esxi01:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address       Port  Iface Name  Cert Thumbprint  SubClusterUuid
------------------------------------  ---------  ----------------  --------------  -----  ----------  ---------------  --------------
67874ba3-8fd5-463f-80fb-6a82910c5ff2          0              true  192.168.1.201  12321                               3a02d572-728d-482b-a94d-2245a6ec99d1
04e3bd93-2846-4474-bae7-e16b602e316f          0              true  192.168.1.160  12321                               3a02d572-728d-482b-a94d-2245a6ec99d1

As esxi01’s IP addresses are not changed, no changes are needed on the other 2 hosts. Let’s verify if vSAN is up and running again.

[root@esxi01:~] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2020-10-29T07:15:40Z
   Local Node UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec
   Local Node Type: NORMAL
   Local Node State: AGENT
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 67874ba3-8fd5-463f-80fb-6a82910c5ff2
   Sub-Cluster Backup UUID: 04e3bd93-2846-4474-bae7-e16b602e316f
   Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1
   Sub-Cluster Membership Entry Revision: 3
   Sub-Cluster Member Count: 3
   Sub-Cluster Member UUIDs: 67874ba3-8fd5-463f-80fb-6a82910c5ff2, 04e3bd93-2846-4474-bae7-e16b602e316f, 9f7326ad-f815-45b1-a809-ece25fddc7ec
   Sub-Cluster Member HostNames: esxi03.corp.contoso.com, esxi02.corp.contoso.com, esxi01.corp.contoso.com
   Sub-Cluster Membership UUID: 3b5c9a5f-3063-68bb-eafc-0c42a1719576
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: 9f7326ad-f815-45b1-a809-ece25fddc7ec 2 2020-10-29T07:15:25.0

Yay!

Rest of the steps are pretty straightforward. The key takeaway here is: to join a host to a cluster, it must be either in the maintenance mode (i.e. all VMs shut off), or only have vCenter running on it. All other steps are essential to solve the chicken-and-egg problem.

Shutdown all the VMs running on the hosts if you haven’t already done this
Find the node with the oldest CPU (assume it is esxi01), and if possible, connect a temporary non-vSAN datastore (NFS or a local storage device)
install vCenter onto esxi01 using the temporary datastore
Set up vCenter (networking, admin user, certificate)
Add esxi01 to the vCenter, put it in a new cluster, you can use the cluster quickstart wizard but do not let it configure networking for you
Enable VMWare EVC on the new cluster
If you have a backup for distributed switch config, restore it; otherwise configure a new distributed switch
Add another host (say, esxi02) to the vCenter, do not add it to a cluster yet
Add esxi02 to the distribute switch and migrate all adapters
vMotion the vCenter VM to esxi02
Add esxi01 and esxi03 to the distribute switch and migrate all adapters
Go to the web portal of esxi02 and esxi03, put them into maintenance mode, set vSAN migration mode to “no data migration” (do not use vCenter to put them into maintenance mode as this will cause vSAN to evict data; also, this will temporary block all requests to the vSAN datastore, so make sure nothing is running on it)
Add esxi02 and esxi03 to the cluster and configure the cluster in the quickstart wizard
If this caused vSAN to move some data back and forth, wait for the migration to finish
Verify all objects in vSAN is readable, and try restart the VMs
vMotion the vCenter back onto the vSAN datastore

Now we have a new vCenter server and a new cluster good to go.

Cleaning Up

If you still want to configure vSAN from vCenter later, first execute the following command on every ESXi host:

esxcfg-advcfg -d /VSAN/IgnoreClusterMemberListUpdates

This allows the vSAN agent to receive further configuration from the vCenter. Then let vCenter synchronize once with all the hosts: Cluster -> Monitor -> Skyline Health -> vCenter state is authoritative -> click on “UPDATE ESXI CONFIGURATION”.

If you have custom storage policies, you can restore them using the following command in vCenter ruby console:

Command> rvc administrator@vsphere.local@localhost
vsan.recover_spbm /localhost/<datacenter_name>/computers/<cluster_name>

vSAN default policy will be created automatically.

If you have any inaccessable object, SSH login to one of the hosts containing that object, then delete it manually:

/usr/lib/vmware/osfs/bin/objtool delete -f -v 10 -u <object_uuid>

The following things will require a rebuilt by hand in the new vCenter:

*   users, groups, permissions
*   content libraries
*   host profiles
*   HA & DRS
*   VM rules

If you have vSAN file services configured, you might need to re-enable them from vCenter. You will need to re-upload the OVAs, and you won’t be able to change the configuration. Note that vSAN file services 7.0U1 is extremely buggy and locked itself up (I can’t enable it/disable it/configure it/use it) on my cluster, so I currently do not recommend using it in production.

If you have some “Unable to connect to MKS” error when connecting to VM consoles on the new vCenter: see “Unable to connect to MKS” error in vSphere Web Client (2115126)

Final Thoughts

One thing I like about vSphere is its ability to continue functioning without a centralized control plane. HA, multiple-access datastores, and vSAN are all designed around this basic assumption and this have saved me many times. On the other hand, vCenter is a fragile thing, and vCenter 7.0, with a lot legacy Java components being rewritten by Python, is much more fragile than ever before.

Always export and backup your distributed switch config, even if you have automated backup for vCenter. This will save you a lot time in case you must set up a new vCenter. If you have vSAN file services configured, failing to restore the old distributed switch after a vCenter rebuild might render the entire service inaccessible. (If you can’t re-enable it on the vSphere UI, try to call the vCenter API vim.vsan.ReconfigSpec with a different port group; there is a chance, but your mileage might vary.)

Drown in Codes

I was sixteen with an open heart…

vSAN 7.0U1 Cluster Rebuild: A Firsthand Experience