Archive for the ‘VMWare’ Category

Quick one today:

Rarely, but every once in a while, we see this entry in our production virtual database servers. Initially it seems benign, but upon further investigation we learnt that this is a repeating issue and there is an explanation for it.

Error Message:

spid125     I/O is frozen on database master. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup

The same message could appear for all databases on this drive.


In short, this occurs when VSS backups are made through VDI.

Long Story:

Sql Server has a lot of APIs that enable independent software vendors to integrate Sql Server into their own products. Through these APIs, they carry out necessary functionality while maintaining the reliability, feature-functionality and performance of Sql Server. One such API interface is VDI (Virtual Device Interface); This works from within the VSS (Volume Shadow copy Service), through which 3rd party tools could perform backup & restore operations through BACKUPs using SNAPSHOT.

Internals of VSS in Sql Server (from Microsoft blog)

Internals of VSS in Sql Server (from Microsoft blog)

Using the VDI interface, when such command (BACKUP WITH SNAPSHOT) is executed, Sql Server needs to make sure that the backups are reliable and complete. Since data files are under constant lock by Sql Server with continuous updates going on, Sql Server has to temporarily cease all I/O into the data files to secure a volume snapshot.

Once that is complete, the same 3rd party tool tells Sql Server through VDI that I/O can now resume. This is usually for a short duration, but depending on the size of the file and the throughput of the drive, it could sometimes take longer.

In our environment, we see I/O freeze for a few seconds. This is not good. So we schedule these snapshot backups during daily scheduled down times.

2016-02-19 18:36:36.37 spid125 I/O is frozen on database master. No user action is required. However, if I/O is not resumed promptly, you could cancel the backup.
2016-02-19 18:36:47.91 spid125 I/O was resumed on database master. No user action is required.

Since this is internal Sql Server functionality that Microsoft controls, looks like they are working towards making it more efficient in future releases. Some Microsoft tools that rely on this are:

  • Microsoft Backup (ntbackup.exe)
  • Volume Shadow Copy (VSS)
  • Data Protection Manager (DPM)

Some 3rd party tools that use VSS to perform such backup and restore operations are:

  • Symantec Business Continuance Volume
  • Veritas Net Backup


Hope this helps,

Read Full Post »

Quick one today

A few months ago, in our lab we tried to enable SSH access to our ESXi hosts and these are the steps we followed.

  1. First, if you have access to the ESXi console, hit F2 and log in.
  2. From the System Customization screen, go to ‘Troubleshooting Options’, and hit <Enter>.
  3. In the ‘Troubleshooting Mode Options’ screen, you see a bunch of toggle options.
  4. Go to ‘Enable SSH’ and hit <Enter> (when you hit <Enter>, the option toggles to ‘Disable SSH’).
  5. Exit from each screen.


Hope this helps,

Read Full Post »

Quick one today:

Earlier in the lab, during set of a new ESXi host, we ran into this cryptic error.

"HostStorageSystem.ComputeDiskPartitionInfo" for object "storageSystem" on ESXi "<ip address>" failed

Looks like ESXi host, was not able to correctly wipe the disk. So, we need to manually delete the partition to make it available for ESXi host. The following steps worked for us:

Step 1: Make sure SSH access is enabled on the ESXi host.’

Step 2: Connect to ESXi host using putty.

Putty to connect to ESXi host through SSH

Putty to connect to ESXi host through SSH


Step 3: Enter root and password to log in.

Login using Putty

Login using Putty

Step 4: Enter the command esxcfg-scsidevs -l to list the name of the disk that you want to wipe.

~ # esxcfg-scsidevs -l
Device Type: Direct-Access
Size: 237824 MB
Display Name: Local DELL Disk (naa.6842b2b0189ae90019c1b00d08f81f9f)
Multipath Plugin: NMP
Console Device: /vmfs/devices/disks/naa.6842b2b0189ae90019c1b00d08f81f9f
Devfs Path: /vmfs/devices/disks/naa.6842b2b0189ae90019c1b00d08f81f9f
Vendor: DELL Model: PERC H700 Revis: 2.30
SCSI Level: 5 Is Pseudo: false Status: on
Is RDM Capable: false Is Removable: false
Is Local: true Is SSD: false
Other Names:
VAAI Status: unsupported

Step 5: Run “partedUtil get” command to get the details of the disk

partedUtil get /dev/disks/naa.6842b2b0189ae90019c1b00d08f81f9f

66987 255 63 9876546456
1 20 9876546456 165 128

Step 6: Now lets delete this partition using “partedUtil delete” command.

partedUtil delete /dev/disks/naa.6842b2b0189ae90019c1b00d08f81f9f 1

After running this command successfully, you could not go back to vSphere Client and recreate data store.

Create New Disk/LUN ESXi Host

Create New Disk/LUN ESXi Host

Hope this helps,

Read Full Post »

Quick one today:

For one of our lab servers, we needed reconfigure an ESXi host from dynamic (default) to static IP address. Following are the steps, that worked for us. Not sure how, screenshots could be take from a ESXi console, so only text instructions for this post.

Go to the physical ESXi machine and perform the following operations.

  • At the ESXi console screen press F2 to go to Customize System/View Logs
  • Login with your credentials (root/password)
  • Using up/down arrows, select “Configure Management Network” and hit enter
  • Select “Network Adapters” and hit enter. Make sure the status says “Connected“.
  • Go to IP configuration and change from dynamic to static IP and hit enter
  • Using up/down arrows and enter key:
    • Enter a valid static IP address in your network.
    • Subnet mask
    • Default Gateway
    • Hit enter for <OK>
  • Select DNS configuration and hit enter to change its settings.
    • Enter Primary DNS Server
    • Alternate DNS Server
    • Enter valid HostName in your network
    • Hit Enter for <OK> (accepting the settings)
  • Press Esc to main screen.
  • Hit “Y” to accept/apply the changes and restart the host for changes to take effect.

Helpful hints: For Subnet mask, DNS, Default Gateway, etc, I ran ipconfig /all on a different physical machine next to this ESXi host (in the same network) and used the same IP addresses.

After restart, the ESXi host shows the new static IP address with Hostname.

Thanks for this post for teaching me.

Hope this helps,

Read Full Post »

Quick one today:

Recently, during troubleshooting an issue with one of our customers, some patterns emerged that indicated the possibility of resource contention for their repeated timeout issues. Upon further digging in, the CPU usage was abnormal. The database server was on a Virtual Machine with 8 processors allocated. The CPU usage was only on one processor. The rest were idle. 0% usage.

As part of that experience some lessons were gathered out of which today’s post emerges on CPU Affinity mask.

CPU Affinity Mask:

Sql Server uses all CPUs available through the Operating System. It creates schedulers on all the CPUs to make the best utilization of the resources for any given workload. Operating Systems, as part of its multitasking, it needs to switch process threads from processor to another. This is usually not detrimental for normal operations; But resource intensive operations like Sql Server this undercuts performance. Such Context Switching is not always a welcome scenario. To minimize we could configure the processors in a way that all the Sql Server load will be directed to a pre-selected group of processors. This is achieved using CPU Affinity Mask.

As you imagine, this is a HEX value. But we convert the HEX value into integer and configure Sql Server to lean towards (affinitize) a group of CPUs.

Let’s imagine a machine with 16 processors; When you assign all the processors to Sql Server and allow it to optimally utilize all the resources the settings look like this:

CPU Affinity Mask (default)

CPU Affinity Mask (default)

The options on the top convey that we are enabling Sql Server to fully manage the resource allocation with all the CPUs.

  • Automatically set processor affinity mask for all processors
  • Automatically set I/O affinity mask for all processors

I/O tasks like Lazywriter, CHECKPOINT, Log Writer, etc are directly affected by the dedicated CPUs allocated for I/O activities.

NUMA Noode0 and NUMA Node 1 are the CPU groups (either Hardware or Software configuration) to enhance hardware resource consumption. Each NUMA node has multiple CPUs with dedicated memory.

The ‘Processor Affinity’ Check box allows you to select individual CPUs, from each NUMA nodes, to SQL Server. Since the check box is not checked, none of the CPUs (from that NUMA node) have their affinity set (not allocated primarily for Sql Server).

Now, if you want to select all the processors from NUMA Node 0 to Sql Server, select that Check box. If you want to accomplish the same with T-SQL scripts like this:

-- Since these are advanced options, we first need to set 'Show Advanced Options' & RECONFIGURE
EXEC sp_configure 'show advanced options', 1

--  Integer value 15 translates to 0x000F (which shows that the last 4 bits are set to 1. Similarly 240 translates to 0x00F0
EXEC sp_configure 'affinity mask', 15
EXEC sp_configure 'affinity I/O mask', 240

--  Once completed, reset the Advanced Options
EXEC sp_configure 'show advanced options', 0

Affinity Mask for each processor could be set using a 1 or 0 bit flag. 1 signifies the affinity being set. So each processor has an affinity mask bit flag that controls its affinity.

Let’s says you want to allocate all the 8 CPUs in NUMA Node 0 to Sql Server; First 4 with CPU affinity and last 4 for I/O affinity. This are the steps to follow:

Imagine setting 1 (bit) for each allocated CPU; Then convert that value into INTEGER.

In Binary the result looks like this 00001111; It translates to 0X0F in HEX. Which is 15 in integer. This is the value you want to set for ‘affinity mask’.

Now, let’s try the same logic for I/O affinity mask. Let’s set the last 4 bits to 1.

So, 11110000 in binary translates to 0xF0 in HEX and 240 in integer.

Putting it all together it becomes:

EXEC sp_configure 'affinity mask', 15
EXEC sp_configure 'affinity I/O mask', 240
After setting CPU Affinity Mask

After setting CPU Affinity Mask

Since Following is the table from MSDN that shows other example:

These are affinity mask values for an 8-CPU system.

Decimal value Binary bit mask Allow SQL Server threads on processors
1 00000001 0
3 00000011 0 and 1
7 00000111 0, 1, and 2
15 00001111 0, 1, 2, and 3
31 00011111 0, 1, 2, 3, and 4
63 00111111 0, 1, 2, 3, 4, and 5
127 01111111 0, 1, 2, 3, 4, 5, and 6
255 11111111 0, 1, 2, 3, 4, 5, 6, and 7

Some more points:

  • One byte (8 bits) covers affinity mask for 8 CPUs. And two bytes for 16 CPUs, 3 bytes for 24; 4 for 32 CPUs.
  • For more than 32 processors, you want to use affinity64 for another 32 processors.
  • It is recommended not to use the same CPU for both CPU affinity and I/O affinity.
  • Changing CPU affinity mask does not require restart.
  • Any schedulers assigned to those CPUs will be shutdown once the current workload is completed. Any new workload, is assigned to new schedulers (or the assigned CPUs)

For more information read MSDN BoL article.

Hope this helps,

Read Full Post »

More and more database servers are being migrated into virtual machines. The momentum towards virtual machines is strong, with good reasons, and as a DBA we need to come up with ways to make sure that our database servers are correctly configured to fully utilize the capabilities of virtual environments. Database servers are usually the last ones to be converted into virtual machines for obvious reasons, but in general, not all database servers need to be physical machines.= ir even have a dedicated machine, per se.

A good percentage of the database servers we have are small (to medium) size that it would not make much different if we virtualized them. Though the migration effort has become simpler with better tools availalbe, the task of comeing up with appropriate database configurations to fully utilize the resources of virtual environment while maintainig the same performance (SLA) is a non-trivial task. More on this subject later, but today, the topic is on vCPUs.

What are vCPUs?

Hypervisor software take full control of the hardware on the host machine and allows system administrators to present them to the virtual machines as needed. This capability allows us to enhance hardware resource utilization and bypass some license agreement restrictions. For example, Windows 2003 Standard Edition is limited to 4 CPUs. So if you have a machine with 6 or 8 CPUs, then the Operating System will only see 4 CPUs. With VMWare there is a way to get around this, a.k.a. cpuid.coresPerSocket/sockets option.

In VMware, vCPUs appears to the operating system as a single core single socket CPU (over simplified). vCPUs are scheduled on the logical CPUs which are hardware execution contexts. So, when you assign 4 vCPUs to a virtual machine, the Operating System on the VM sees them as 4 CPU sockets with single cores each. Now with cpuid.coresPerSocket/sockets option, we could configure the 8 CPUs from host machine as 4 CPUs with 2 cores each (4*2=8 vCPUs) or 2 CPU sockets with 4 cores each (2*4=8 vCPUs).

Example 1:

If you want a virtual machine with 1 socket with 8 cores, then you use the following configuration:

  • vCPU = 8
  • cpuid.coresPerSocket/sockets = 8

Example 2:

If you want a virtual machine with 4 sockets with 2 cores each, then use the following setting:

  • vCPU = 8
  • cpuid.coresPerSocket/sockets = 2

This is an over simplified summary f the underlying details for easy consumption. For more detailed explanation, please refer to this article on VMWare.

Other considerations:

A physical machine with 4 sockets with 4 cores each is not same as a Virtual machine with 4 Sockets with 4 cores each (using above settings). A vCPU is close to a core, but not exactly a core. It is the logical execution context (thread) on a core. A core could have 1 or more logical CPUs (threads). This difference matters of a database server as they are very resource intensive machines.

When the database server is processing 3000 or more transactions per second, having available CPUs to process the requests is important. A shared CPU (logical CPU a.k.a. CPU thread) could result in CPU bottleneck; Might result in heavy SOS_SCHEDULER_YIELD waits. There are some techniques to over come these situations, but need close coordination with System Administrators. As a rule of thumb, small and medium database server (in terms of traffic, size, payload etc) usually make better candidates for virtual machines than VLDBs.


Hope this helps,

Read Full Post »