How It Started
I screwed up a vCenter instance. Actually it is pretty easy to screw up the state-of-the-art hypervisor controller from its beautifully designed web UI, using the appealing buttons that always have been there. The process only requires 2 simple steps:
- Enable vCenter HA
- Replace the machine SSL certificate
The vCenter HA documentation do state “if you want to use custom certificates, you have to remove the vCenter HA configuration” using the smallest font size possible, but the warning is nowhere mentioned in the documentation related to replacing SSL certificates where it should be. The UI won’t stop you from playing with fire, either.
If you have enough time and a lab environment then give it a try. The vCenter VM will reboot a few times before it completely stops working. It will still spin up, but you won’t be able to login anymore. You’ll see a very unhelpful error message on the login screen:
An error occurred when processing the metadata during vCenter Single Sign-On setup - Failed to connect to VMware Lookup Service
By the way, don’t bother trying the vSphere Certificate Manager command-line tool to unscrew the situation; that tool will refuse to do anything if it detects itself running in a HA vCenter cluster. So, if you don’t have any backup or snapshot to revert to, your vCenter is dead.
Things were a little complicated for my case: The dead vCenter VM ran on a 3-node hyperconverged cluster with HA, DRS and vSAN. As the vCenter goes down, now I have a problem.
How It’s Going
Luckily, the ESXi hypervisor is largely independent from vCenter, so I could still log in and do something on the individual hypervisors. Now I had to do something to (hopefully) make the situation better.
Preparing
The first obvious thing I did was to shut down the old vCenter VMs. These does not work anymore and might interfere with the recovery process.
Next thing I did was to do a backup of all important data on the cluster. Backing up an ESXi hypervisor is easy: mount some NFS storage on each hypervisor, and manually move/copy the VMs over. vMotion wouldn’t be available so everything had to be done by hand, when the VMs were shut down.
Then I shut down as many VMs as I can. Although there was possibility that one can rebuild the cluster while keeping some VMs running, I recommend against that.
Prepare a vCenter installer ISO on the workstation, and let’s get into the recovery process.
The First (Unsuccessful) Attempt
Being rather unfamiliar with the new vSphere 7.0, initially my strategy was to just reinstall the vCenter directly onto the vSAN storage, take over the hosts, rebuild the distributed switch by hand, and simply re-configure the cluster. The process did not work: while adding the first host, vCenter reported “Found host(s) esxi02.corp.contoso.com, esxi03.corp.contoso.com participating in the vSAN service which is not a member of this host’s vCenter cluster”, and after a few seconds, vCenter freezed. Later investigation showed that vCenter detected the host had vSAN configured, so it overwrote a single-node vSAN configuration onto that host, breaking the storage it was running on.
Now I have 2 problems: a dead vCenter, and a 3-node vSAN cluster in a split-brain situation.
The Second (Successful) Attempt
Knowing that vSAN won’t automatically delete any inaccessible/broken object, I was confident that all my data was still there, it was just the vSAN configuration that need to be fixed to at least keep the storage running. After some searching on the Internet, I found out that you can actually manage all vSAN configuration on the ESXi hypervisor host! There are some not-very-helpful official documentation on the esxcli vsan subcommand, but it was enough to get me on the correct track.
I enabled SSH on all the hosts, and issued this command to every host:
esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates
This essentially told the vSAN agent running on every host to ignore everything sent by any vCenter. Now that the “manual transmission” mode is engaged, I started to recover the vSAN.
First let’s confirm the status:
[root@esxi01:~] esxcli vsan cluster list
Cluster Information of 3a02d572-728d-482b-a94d-2245a6ec99d1
Enabled: true
Current Local Time: 2020-10-29T07:05:18Z
Local Node UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 9f7326ad-f815-45b1-a809-ece25fddc7ec
Sub-Cluster Member HostNames: esxi01.corp.contoso.com
Sub-Cluster Membership UUID: 665dbc18-5bde-4cb6-a510-7c5185c78f3d
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: dadf3e7c-8162-4815-9d02-08af4d8c4c7b 2 2020-10-29T06:29:11.652
[root@esxi03:~] esxcli vsan cluster list
Cluster Information of 3a02d572-728d-482b-a94d-2245a6ec99d1
Enabled: true
Current Local Time: 2020-10-29T07:09:48Z
Local Node UUID: 67874ba3-8fd5-463f-80fb-6a82910c5ff2
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 67874ba3-8fd5-463f-80fb-6a82910c5ff2
Sub-Cluster Backup UUID: 04e3bd93-2846-4474-bae7-e16b602e316f
Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1
Sub-Cluster Membership Entry Revision: 2
Sub-Cluster Member Count: 2
Sub-Cluster Member UUIDs: 67874ba3-8fd5-463f-80fb-6a82910c5ff2, 04e3bd93-2846-4474-bae7-e16b602e316f
Sub-Cluster Member HostNames: esxi03.corp.contoso.com, esxi02.corp.contoso.com
Sub-Cluster Membership UUID: 3b5c9a5f-3063-68bb-eafc-0c42a1719576
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: dd0af2e3-d7e0-4407-9a50-d87be61513b3 9 2020-10-22T08:59:00.661
We indeed had a split brain. Then kick esxi01 out of the imaginary one-node cluster (it is a very slow process, have some patience), and re-join it with the correct sub-cluster UUID from the other hosts' config:
[root@esxi01:~] esxcli vsan cluster leave
[root@esxi01:~] esxcli vsan cluster join -u 3a02d572-728d-482b-a94d-2245a6ec99d1
[root@esxi01:~] esxcli vsan cluster list
Cluster Information of 3a02d572-728d-482b-a94d-2245a6ec99d1
Enabled: true
Current Local Time: 2020-10-29T07:09:55Z
Local Node UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 9f7326ad-f815-45b1-a809-ece25fddc7ec
Sub-Cluster Member HostNames: esxi01.corp.contoso.com
Sub-Cluster Membership UUID: ab6a9a5f-2401-89af-99aa-0c42a171e24e
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: None 0 0.0
A vCenter configured vSAN cluster would be in the unicast mode (i.e. peer discovery depends on the IP list sent by the control plane), so we also need to synchronize the IP address list of the cluster on every host. Verify the VMKernel adapter for vSAN is set up on esxi01:
[root@esxi01:~] esxcli vsan network list
Interface
VmkNic Name: vmk2
IP Protocol: IP
Interface UUID: 699fe1e6-eaba-49db-9d04-8859ed2b066f
Agent Group Multicast Address: 224.2.3.4
Agent Group IPv6 Multicast Address: ff19::2:3:4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 224.1.2.3
Master Group IPv6 Multicast Address: ff19::1:2:3
Master Group Multicast Port: 12345
Host Unicast Channel Bound Port: 12321
Data-in-Transit Encryption Key Exchange Port: 0
Multicast TTL: 5
Traffic Type: vsan
If you don’t see “vsan” traffic type in the output, reconfigure your VMKernel adapter. Since esxi02 and esxi03 already know each other, we can concentrate the list from the 2 hosts…
[root@esxi02:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- ----------------------------------------------------------- --------------
67874ba3-8fd5-463f-80fb-6a82910c5ff2 0 true 192.168.1.201 12321 73:F4:93:D8:D8:2A:C0:D3:4F:A6:DF:4D:3D:BE:34:8C:15:D9:45:52 3a02d572-728d-482b-a94d-2245a6ec99d1
9f7326ad-f815-45b1-a809-ece25fddc7ec 0 true 192.168.1.215 12321 05:B1:CF:D5:09:6A:05:7C:D7:C4:69:69:7A:85:04:90:51:D4:9A:D6 3a02d572-728d-482b-a94d-2245a6ec99d1
[root@esxi03:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint
SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- ----------------------------------------------------------- --------------
9f7326ad-f815-45b1-a809-ece25fddc7ec 0 true 192.168.1.215 12321 05:B1:CF:D5:09:6A:05:7C:D7:C4:69:69:7A:85:04:90:51:D4:9A:D6 3a02d572-728d-482b-a94d-2245a6ec99d1
04e3bd93-2846-4474-bae7-e16b602e316f 0 true 192.168.1.160 12321 6D:E4:62:CA:FB:17:96:41:97:F4:22:B9:8F:D8:B2:5E:93:0F:79:0D 3a02d572-728d-482b-a94d-2245a6ec99d1
then play them back onto esxi01 (if you have vSAN witness applications, you need to slightly change the arguments here):
[root@esxi01:~] esxcli vsan cluster unicastagent add -a 192.168.1.201 -U true -u 67874ba3-8fd5-463f-80fb-6a82910c5ff2 -t node
[root@esxi01:~] esxcli vsan cluster unicastagent add -a 192.168.1.160 -U true -u 04e3bd93-2846-4474-bae7-e16b602e316f -t node
[root@esxi01:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- -------------- ----- ---------- --------------- --------------
67874ba3-8fd5-463f-80fb-6a82910c5ff2 0 true 192.168.1.201 12321 3a02d572-728d-482b-a94d-2245a6ec99d1
04e3bd93-2846-4474-bae7-e16b602e316f 0 true 192.168.1.160 12321 3a02d572-728d-482b-a94d-2245a6ec99d1
As esxi01’s IP addresses are not changed, no changes are needed on the other 2 hosts. Let’s verify if vSAN is up and running again.
[root@esxi01:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2020-10-29T07:15:40Z
Local Node UUID: 9f7326ad-f815-45b1-a809-ece25fddc7ec
Local Node Type: NORMAL
Local Node State: AGENT
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 67874ba3-8fd5-463f-80fb-6a82910c5ff2
Sub-Cluster Backup UUID: 04e3bd93-2846-4474-bae7-e16b602e316f
Sub-Cluster UUID: 3a02d572-728d-482b-a94d-2245a6ec99d1
Sub-Cluster Membership Entry Revision: 3
Sub-Cluster Member Count: 3
Sub-Cluster Member UUIDs: 67874ba3-8fd5-463f-80fb-6a82910c5ff2, 04e3bd93-2846-4474-bae7-e16b602e316f, 9f7326ad-f815-45b1-a809-ece25fddc7ec
Sub-Cluster Member HostNames: esxi03.corp.contoso.com, esxi02.corp.contoso.com, esxi01.corp.contoso.com
Sub-Cluster Membership UUID: 3b5c9a5f-3063-68bb-eafc-0c42a1719576
Unicast Mode Enabled: true
Maintenance Mode State: OFF
Config Generation: 9f7326ad-f815-45b1-a809-ece25fddc7ec 2 2020-10-29T07:15:25.0
Yay!
Rest of the steps are pretty straightforward. The key takeaway here is: to join a host to a cluster, it must be either in the maintenance mode (i.e. all VMs shut off), or only have vCenter running on it. All other steps are essential to solve the chicken-and-egg problem.
- Shutdown all the VMs running on the hosts if you haven’t already done this
- Find the node with the oldest CPU (assume it is esxi01), and if possible, connect a temporary non-vSAN datastore (NFS or a local storage device)
- install vCenter onto esxi01 using the temporary datastore
- Set up vCenter (networking, admin user, certificate)
- Add esxi01 to the vCenter, put it in a new cluster, you can use the cluster quickstart wizard but do not let it configure networking for you
- Enable VMWare EVC on the new cluster
- If you have a backup for distributed switch config, restore it; otherwise configure a new distributed switch
- Add another host (say, esxi02) to the vCenter, do not add it to a cluster yet
- Add esxi02 to the distribute switch and migrate all adapters
- vMotion the vCenter VM to esxi02
- Add esxi01 and esxi03 to the distribute switch and migrate all adapters
- Go to the web portal of esxi02 and esxi03, put them into maintenance mode, set vSAN migration mode to “no data migration” (do not use vCenter to put them into maintenance mode as this will cause vSAN to evict data; also, this will temporary block all requests to the vSAN datastore, so make sure nothing is running on it)
- Add esxi02 and esxi03 to the cluster and configure the cluster in the quickstart wizard
- If this caused vSAN to move some data back and forth, wait for the migration to finish
- Verify all objects in vSAN is readable, and try restart the VMs
- vMotion the vCenter back onto the vSAN datastore
Now we have a new vCenter server and a new cluster good to go.
Cleaning Up
If you still want to configure vSAN from vCenter later, first execute the following command on every ESXi host:
esxcfg-advcfg -d /VSAN/IgnoreClusterMemberListUpdates
This allows the vSAN agent to receive further configuration from the vCenter. Then let vCenter synchronize once with all the hosts: Cluster -> Monitor -> Skyline Health -> vCenter state is authoritative -> click on “UPDATE ESXI CONFIGURATION”.
If you have custom storage policies, you can restore them using the following command in vCenter ruby console:
Command> rvc administrator@vsphere.local@localhost
vsan.recover_spbm /localhost/<datacenter_name>/computers/<cluster_name>
vSAN default policy will be created automatically.
If you have any inaccessable object, SSH login to one of the hosts containing that object, then delete it manually:
/usr/lib/vmware/osfs/bin/objtool delete -f -v 10 -u <object_uuid>
The following things will require a rebuilt by hand in the new vCenter:
* users, groups, permissions
* content libraries
* host profiles
* HA & DRS
* VM rules
If you have vSAN file services configured, you might need to re-enable them from vCenter. You will need to re-upload the OVAs, and you won’t be able to change the configuration. Note that vSAN file services 7.0U1 is extremely buggy and locked itself up (I can’t enable it/disable it/configure it/use it) on my cluster, so I currently do not recommend using it in production.
If you have some “Unable to connect to MKS” error when connecting to VM consoles on the new vCenter: see “Unable to connect to MKS” error in vSphere Web Client (2115126)
Final Thoughts
One thing I like about vSphere is its ability to continue functioning without a centralized control plane. HA, multiple-access datastores, and vSAN are all designed around this basic assumption and this have saved me many times. On the other hand, vCenter is a fragile thing, and vCenter 7.0, with a lot legacy Java components being rewritten by Python, is much more fragile than ever before.
Always export and backup your distributed switch config, even if you have automated backup for vCenter. This will save you a lot time in case you must set up a new vCenter. If you have vSAN file services configured, failing to restore the old distributed switch after a vCenter rebuild might render the entire service inaccessible. (If you can’t re-enable it on the vSphere UI, try to call the vCenter API vim.vsan.ReconfigSpec with a different port group; there is a chance, but your mileage might vary.)
References
- The Resiliency of vSAN - Recovering my 2-Node Direct Connect While Preserving vSAN Datastore
- Configure 2-Node VSAN on ESXi Free Using CLI Without VCenter
- VMware vSAN cache disk failed and how to recover from it
- vSAN question: Restore VCSA on vSAN
- Administering VMware vSAN (PDF)
- Purge inaccessible objects in VMware vSAN
- Fixing these dratted Unknown vSAN Objects
- Fix orphaned vSAN objects
- VMware VSAN delete/purge inaccessible objects
- VMware®Ruby vSphere Console Command Reference for Virtual SAN (PDF)