VSAN Lab issues due to Infiniband OpenSM failover

This isn’t really a blog where you will get a recipe on how to implement VMware Virtual SAN (VSAN) or InfiniBand technologies, but more a small account of my troubles I experienced yesterday with my infrastructure. I did publish a picture yesterday on twitter, that didn’t look to go.

VSAN Infrastructure in bad shape

Cause: Network infrastructure transporting the VSAN traffic because unavailable for 5-6 minutes

Issue: All VMs became frozen, as all Read/Write where blocked. I Powered Off all the VMs. Each VMs became an Unidentified object as seen above.

Remediation: Restarted all VSAN hosts at the same time, and let the infrastructure stabilize about 10 minutes before restarting the first VM.

I got myself into this state, because I was messing with the core networking infrastructure in my lab, this was not a VSAN product error, but a side effect of the network loss. After publishing this tweet and picture, I had a dinner that lasted a few hours, and when I got home, I simply decided to restart the four VSAN nodes at the same time, let the infrastructure simmer for 10 minutes while looking at the host logs, then I restarted my VMs.

 

Preamble.

Since beginning of December 2013, I’m running all my VMs direct from my VSAN datastore, no other iSCSI/NFS repository is used. If VSAN goes down, everything goes down (including Domain Controllers, SQL Server and vCenter).

 

Network Issue.

As some of you know, the VSAN traffic in my lab, is being transported by InfiniBand. Each host has two 20Gbps connections to the InfiniBand switches. My InfiniBand switches are described in my LonVMUG presentation about using Infiniband in the Lab. An InfiniBand fabric needs a Subnet Manager to control the various entries, I got lucky in my first InfiniBand switch purchase, I got myself a Silverstorm 9024-CU24-ST2 model from 2005.

silverstorm9024chassis

Yet the latest firmware that can be found on Intel’s 9000 Edge Managed Series website. And the latest firmware 4.2.5.5.1 from Jul 2012 now adds a hardware Subnet Manager. This is simply awesome for a switch created in 2005.

Silverstorm 9024

Silverstorm 9024

Okay, I disgress here…. bear with me. Now, not all the InfiniBand switches come with a Subnet Manager, actually only a select few and more expensive switches have this feature. What can you do, when you have an InfiniBand switch without a management stack, well you run the Software version of the Open Subnet Manager (OpenSM) directly on the ESXi host, or a dedicated Linux node.

Yesterday, I was validating a new build of the OpenSM daemon compiled by Raphael Schitz  (@Hypervisor_fr) that has some improvements. I had placed the new code on each of my VSAN nodes, and shutdown the Hardware Subnet Manager to use only the Software Hardware Manager. It worked well enough, only seeing a simple 2 second RDP interuption to the vCenter.

It was only when I attempted to fake the death of the Master OpenSM on my esx13.ebk.lab host, that I created enough fluctuation in the InfiniBand fabric, causing an outage, that I estimate to have lasted between 3 and 5 minutes. But as the InfiniBand fabric is used to transport all my VSAN traffic at high-speed, all my VMs because frozen, all IOPs suspended, leaving me only the option to connect with the vSphere C# Client to the hosts directly, wait to see if things would stabilize. Unfortunately, that did not seem to be the case after 10 minutes, so I powered off the running VMs.

Yet each of my hosts, was now disconnected from the other VSAN nodes, and the vsanDatastore was not showing at it’s usual 24TB, but at 8TB. It bit of a panic set in, and I tweeted about a Shattered VSAN Cluster.

When I came home a few hours later, I simply restarted all my four VSAN nodes (3 Storage+Compute and 1 Compute-Only), lets some synchronization take place, and I was able to restart my VMs.

 

Recommendations

These recommendations are only if you use VSAN with an InfiniBand backbone used to replicate the storage objects across nodes. If you have a InfiniBand switch which support a hardware Subnet Manager, use it. If you have an unmanaged InfiniBand switch, you need to ensure that the Subnet Manager is kept stable and always available.

If you use InfiniBand as the network backbone for vMotion or other IP over IB, the impact of having a software Subnet Manager election is not the same (HA reactivity)

I don’t have yet a better answer yet, but I know Raphael Schitz (@Hypervisor_fr) has some ideas, and we will test new OpenSM builds for this kind of issues.

 

Your comments are welcome…

 

 

 

vCenter VM Hardware Upgrade results in Hung vCenter services

Yesterday, while upgrading a new vCenter virtual machine that was created on an ESX 3.5 host, to a new ESXi 5.0 host, we found ourself with a VM that was refusing to start any services.

The virtual machine is running

  • Windows Server 2008 R2 SP1
  • vCenter 5.0 Update 1
  • SQL Server 2008 R2 SP1 (10.50.2792)
  • and the whole suite of vCenter services (vum, syslog, dump, web service).

The virtual machine was created  on an ESX 3.5 (Build 604481) and was configured as a VM Version 4.  The target platform was a new ESXi 5.0 Update 1 host (Build 623860). So we cold migrated the vCenter to the new system, via a shared VMFS3 datastore.

At this point, the virtual machine was running fine as a VM Version 4 on the ESXi 5.0 Update 1.

I then started the upgrade process, with the installation of the VMware Tools, to ensure I had all the proper drivers in the VM. I then powered off the virtual machine, and upgraded the hardware to VM Version 8.

vCenter - VM Version 8

The system restarted but there was an issue with the various services. I could not open the network settings, I could not uninstall the VMware Tools as the Windows Installer service was not running. My data and database log disks where not visible, I could not open the disk management control panel.

After much troubleshooting, restarting the virtual machine in safe mode and various other tests, my colleague found this very interesting article Windows Server 2008 computer hang during startup while “applying computer settings” and services configured to start automatically fail to start http://support.microsoft.com/kb/2004121 

The following two paragraphs are taken from the Microsoft Support Article.

Cause

The problems described in the symptoms section occur because of a lock on the Service Control Manager (SCM) database.  As a result of the lock, none of the services can access the SCM database to initialize their service start requests. To verify that a Windows computer is affected by the problem discussed in this article, run the following command from the command Prompt:

[box]sc querylock

The output below would indicate that the SCM database is locked:

QueryServiceLockstatus – Success

IsLocked : True

LockOwner : .\NT Service Control Manager

LockDuration : 1090 (seconds since acquired)

[/box]

Let me fix it myself

you can modify the behavior of HTTP.SYS to depend on another service being started first.  To do this, perform the following steps:
[box]

  • Open Registry Editor
  • Navigate to HKLM\SYSTEM\CurrentControlSet\Services\HTTP and create the following Multi-string value: DependOnService
  • Double click the new DependOnService entry
  • Type CRYPTSVC in the Value Data field and click OK.
  • Reboot the server

[/box]

NOTE: Please ensure that you make a backup of the registry / affected keys before making any changes to your system.

After having made the registry modification and a final restart, the virtual machine was working again as expect. This was a very strange and bizarre error I have never heard someone run into. So here it is resumed, and may it be usefull someday to someone else…

 

 

 

Create vCenter database quickly with Transact-SQL

Creating new databases for VMware vCenter is something I have to do over and over again. I use mostly Microsoft SQL Server 2008 R2 so here are six quick procedures to simplify the creation and make all your vCenter databases to the same standard. I keep my Transact-SQL scripts in Evernote, so I just need to make six Copy & Paste and my vCenter database is created within 3 minutes. You can find the Transact-SQL to download at the bottom of this post.

My general rule when I create the VMware vCenter database is to have my user database on a separate disk from the operating system. This disk is formatted with 64K block size. SQL Server works with two specific IO request size 8K and 64K in general, so having 64K block size is optimum for SQL Server databases (See Disk partition alignment Best Practice for SQL Server ). I usually create a directory path for my SQL database D:\Microsoft SQL Server in which I will create two directories for the vCenter databses, vcenter-server and vcenter-update-manager.

Microsoft SQL Server directory structure for User Databaes

Using the Microsoft SQL Server Management Studio interface we can start a New Query, in which we will add the Transact-SQL code.

SQL Server Management Studio – Open a New Query

Now let’s insert the Transact-SQL script to create the new vcenter-server database. My database settings limit the database to grow past 16GB, and increases the database as it grows by blocks of 512MB. The initial size starts at 1GB. The code below is a bit wide for this blog, but you can find the full Transact-SQL code at the bottom.

USE [master]
GO
CREATE DATABASE [vcenter-server] on PRIMARY
(NAME = N’vcenter-server’, FILENAME = N’D:\Microsoft SQL Server\vcenter-server\vcenter-server.mdf’, SIZE = 1024MB, MAXSIZE = 16384MB, FILEGROWTH = 512MB)
LOG ON
(NAME = N’vcenter-server_log’, FILENAME = N’D:\Microsoft SQL Server\vcenter-server\vcenter-server.ldf’, SIZE = 512MB, MAXSIZE = 2048MB, FILEGROWTH = 256MB)
COLLATE SQL_Latin1_General_CP1_CI_AS
GO

vCenter SQL Database creation with settings

Lets now change the Recovery mode of our database for our needs, to Simple.

USE [vcenter-server]
GO
ALTER DATABASE [vcenter-server] SET RECOVERY SIMPLE;
GO

vCenter SQL Database alter recovery mode to Simple

Lets create a dedicated vCenter database user such as vpxdb.

USE [vcenter-server]
GO
CREATE LOGIN [vpxdb] WITH PASSWORD = ‘insert-a-password-here’, DEFAULT_DATABASE = [vcenter-server], DEFAULT_LANGUAGE=[us_english], CHECK_POLICY=OFF
GO
CREATE USER [vpxdb] for LOGIN [vpxdb]
GO

SQL Database vpxdb user creation

Now we let the newly create database user connect to the vCenter database.

USE [msdb]
GO
CREATE USER [vpxdb] FOR LOGIN [vpxdb]
GO

SQL Database vpxdb user login for vCenter Database

We allow the newly create vpxdb database user have db_owner rights to the [MSDB] database, so that the user can create the SQL Agent jobs in SQL.

USE [msdb]
GO
EXEC sp_addrolemember N’db_owner’, N’vpxdb’
GO

SQL Database user vpxdb db_owner rights to MSDB

And last we change the ownership of the vCenter Database for the vpxdb user.

USE [vcenter-server]
GO
sp_addrolemember [db_owner],[vpxdb]
GO

SQL Database user vpxdb db_owner rights to vcenter-database

You can find the all the Transact-SQL code in this simple text file vCenter-SQL-TransactSQL-database.txt. If you want the same type of Transact-SQL script to help you setup the vCenter Update Manager database check out this text file vCenter-Update-Manager-SQL-TransactSQL-database.txt