Top Support Issues and How to Solve Them Sriram Rajendran Escalation Engineer

Top Support Issues and How to
Solve Them
Sriram Rajendran
Escalation Engineer
Nov 6 2008
Agenda
Support issues covered
VMFS volumes missing / lost and LUN re-signaturing
ESX host not-responding / disconnected in VirtualCenter
Expanding the size of a VMDK with existing Snapshots.
Virtualization performance misconception
Cannot See My VMFS Volumes
A common support issue – “My VMFS volumes have disappeared!”
In fact, it is often the case that the volumes are seen as snapshots
which, by default, are not mounted.
Why are VMFS volumes seen as snapshots when they are not?
ESX server A is presented with a LUN on ID 0.
The same LUN is presented to ESX server B on ID 1.
VMFS-3 volume created on LUN ID 0 from server A.
Volume on server B will not be mounted when SAN is rescanned.
Server B will state that the volume is a snapshot because of LUN ID
mismatch.
LUNs must be presented with the same LUN IDs to all ESX hosts.
In the next release of ESX, LUN ID is no longer compared if the target
exports NAA type IDs.
How Does ESX Determine If Volume Is A Snapshot?
When a VMFS-3 volume is created, the SCSI Disk ID data from the
LUN/storage array is stored in the volume’s LVM header.
This contains, along with other information, the LUN ID.
When another ESX server finds a LUN with a VMFS-3 filesystem, the
SCSI Disk ID information returned from the LUN/storage array is
compared with the LVM header metadata.
The VMkernel treats a volume as a snapshot if there is a mismatch in
this information.
How Should You Handle Snapshots?
First of all, determine if it is really a snapshot:
If it is a mismatch of LUN IDs across different ESX hosts, fix the LUN ID
through array management software to ensure that the same LUN ID is
presented to all hosts for a share volume.
Other reasons a volume might appear as a snapshot could be changes
in the way the LUN is presented to the ESX:
• HDS Host Mode setting
• EMC Symmetrix SPC-2 director flag
• Change from A/P firmware to A/A firmware on array
If it is definitely a snapshot, you have two options:
Set EnableResignature
Or
Disable DisAllowSnapshotLUN
LVM.EnableResignature
Used when mounting the original and the snapshot VMFS Volumes
on the same ESX.
Set LVM.EnableResignature to 1 and issue a rescan of the SAN.
This updates the LVM header with:
•
new SCSI Disk ID information
•
a new VMFS-3 UUID
•
a new label
Label format will be snap-<generation number>-<label>, or
snap-<generation number>-<uuid> if there is no label, e.g.
•
Before resignature:
/vmfs/volumes/lun2
•
After resignature:
/vmfs/volumes/snap-00000008-lun2
Remember to set LVM.EnableResignature back to 0.
LVM.DisallowSnapshotLUN
DisallowSnapshotLUN will not modify any part of the LVM header
To allow the mounting of snapshot LUNs, set:
EnableResignature to 0 (disable)
and
DisallowSnapshotLUN to 0 (disable)
Do not use DisallowSnapshotLUN to present snapshots back to
same ESX server
LVM.EnableResignature overrides LVM.DisallowSnapshotLUN
LVM.EnableResignature
LVM.EnableResignature
LUN 0
LUN 1
Storage
SPA
0
SPB
1
0
1
FC Switch 1
will have to be used to
make the volume
located on cloned LUN,
LUN 1, visible to the
same ESX server after
a rescan.
FC Switch 2
HBA 1
HBA 2
Server A
Two volumes with the same UUID must not be presented to the same
ESX server. Issues with data integrity will occur.
LVM.EnableResignature OR LVM.DisallowSnapshotLUN
-- snapshot --
LUN 0
Snapshot LUN presented
to a different ESX server
LUN 1
Storage
SPA
0
1
FC Switch 1
HBA 1
HBA 2
Server A
We can present the snapshot
LUN, LUN 1, using
DisallowSnapshotLUN = 0
on Server B as long as Server
B cannot see LUN 0.
SPB
0
1
FC Switch 2
HBA 1
HBA 2
Server B
If Server B can also see LUN 0, then we must use resignaturing since we cannot present
two LUNs with the same UUID to the same ESX server.
LVM.DisallowSnapshotLUN
Remote snapshot,
e.g. SRDF
Production
LUN 0
DR site
LUN 1
Storage A
SPA
0
Storage B
SPB
1
0
1
FC Switch 1
FC Switch 2
HBA 1
HBA 2
Server A
SPA
0
SPB
1
0
1
FC Switch 3
FC Switch 4
HBA 1
HBA 2
Server B
Since there is not going to be a LUN with the same UUID at the remote site, one can allow snapshots.
ESX host not-responding/disconnected in VirtualCenter
Components involved in the communication between ESX and VC
servers.
VC Server
ESX Server
VPXD
VC agent
VPXA
Host Agent
HostD
Back to the issue
Customer complaints: ESX server is seen in disconnected or notresponding state in VC server.
If the ESX host is seen in Disconnected state then reconnecting the
Host will solve the issue.
However, in most cases, the ESX host is seen in “Not-Responding”
state.
•
What do we do in this case?
List of things we can do
Verify that network connectivity exists from the VirtualCenter Server to
the ESX Server.
•
Use ping to check the connectivity.
Verify that you can connect from the VirtualCenter Server to the ESX
Server on port 902 (If the ESX Server was upgraded from version 2.x
then verify if you can connect on port 905)
•
Use telnet service to connect to the specified port.
Verify if Hostd agent is running
Verify if VPXA agent is running
Check system resources
Checking if Hostd is alive
Verify that the ESX Server management service, hostd is still alive /
running in ESX server.
How to check if Hostd is still alive or not?
•
Connect to the ESX server using SSH from another
Windows/Linux box.
•
Execute the command, vmware-cmd –l
•
If the command succeeds – Hostd is working fine
•
If the command fails – Hostd Is not working/ probably stopped.
•
Restarting the mgmt-vmware service will get the ESX host
back online in VirtualCenter Server.
•
Once this is done, execute the same command again to know if
Hostd is working fine or not.
Checking if Hostd is alive
If Hostd fails to start then check the following logs for any hints to identify
the issue. /var/log/vmware/hostd.log
Note: In most cases, the obvious reasons we have seen are,
The / root filesystem is full.
There are some rogue VMs registered.
Some of presented LUNs are either corrupted or do not have a valid
partition table.
If you are not able to proceed any further file a ticket with VMware TechSupport.
Checking if VPXA is alive
To verify if the VirtualCenter Agent Service (vmware-vpxa) is running:
Log in to your ESX Server as root, from an SSH session or directly from
the console of the server.
[[email protected]]# ps -ef | grep vpxa
root 24663 1 0 15:44 ? 00:00:00 /bin/sh
/opt/vmware/vpxa/bin/vmware-watchdog -s vpxa -u 30 -q 5
/opt/vmware/vpxa/sbin/vpxa
root 26639 24663 0 21:03 ?
00:00:00 /opt/vmware/vpxa/vpx/vpxa
root 26668 26396 0 21:23 pts/3 00:00:00 grep vpxa
The output appears similar to the following if vmware-vpxa is not running:
[[email protected]]# ps -ef | grep vpxa
root 26709 26396 0 21:24 pts/3
00:00:00 grep vpxa
Checking if VPXA is alive
Some times the VPXA process may become orphaned. Restarting the
vmware-vpxa service helps.
How to restart the service:
/etc/init.d/vmware-vpxa restart
If the Service fails to start, check the following logs to identify the issue,
/var/log/vmware/vpx/vpxa.log
If you are not able to proceed any further file a ticket with VMware TechSupport.
Checking System Resources
Some times the hostd or the vpxa fails to start due to the lack of system
resources.
High CPU utilization on an ESX Server -- esxtop
High memory utilization on an ESX Server -- /proc/vmware/mem
Slow response when administering an ESX Server
Expanding the size of a VMDK with existing Snapshots
You CANNOT expand a VM’s VMDK file while it still has snapshots.
e.g.
#ls *
important.vmdk
important-000001-delta.vmdk
#vmkfstools –X 20G important.vmdk
Data
important.vmdk
important-000001-delta.vmdk
Data
important.vmdk
important-000001-delta.vmdk
Expanding VM With a Snapshot
If you do, you will now have a VM that won’t boot
Expanding VM With a Snapshot
Tricking ESX into seeing the expanded VMDK as the original size.
In this example we have a test.vmdk that we expand from 5GB to 6GB
#vmkfstools -X 6G test.vmdk
Expanding VM With a Snapshot
If we check test.vmdk we see
# Disk DescriptorFile
version=1
CID=3f24a1b3
parentCID=ffffffff
createType="vmfs"
# Extent description
RW 12582912 VMFS "test-flat.vmdk"
# The Disk Data Base
#DDB
ddb.virtualHWVersion = "4"
ddb.geometry.cylinders = "783"
ddb.geometry.heads = "255"
ddb.geometry.sectors = "63"
ddb.adapterType = "buslogic"
Expanding VM With a Snapshot
Original - RW 10485760 VMFS "test-flat.vmdk“
New - RW 12582912 VMFS "test-flat.vmdk“
If we have no “BACKUPS” how do we get the original value?
#grep -i rw test-000001.vmdk
RW 10485760 VMFSSPARSE “test-000001-delta.vmdk"
Expanding VM With a Snapshot
We change test.vmdk RW value.
# Disk DescriptorFile
version=1
CID=3f24a1b3
parentCID=ffffffff
createType="vmfs"
# Extent description
RW 10485760 VMFS "test-flat.vmdk"
# The Disk Data Base
#DDB
ddb.virtualHWVersion = "4"
ddb.geometry.cylinders = "783"
ddb.geometry.heads = "255"
ddb.geometry.sectors = "63"
ddb.adapterType = "buslogic"
Expanding VM With a Snapshot
Commit The snapshot(s)
#vmware-cmd /pathtovmx/test.vmx removesnapshots
Grow the VMDK file
#vmware-cmd –X 6GB test.vmdk
If needed add a snapshot
#vmware-cmd pathtovmx/test.vmx createsnapshot
<name> <description>
Virtualization Performance Myths
CPU affinity
Virtual SMP performance
Ready time
Transparent page sharing
Memory over-commitment
Memory Ballooning
NICTeaming
Hyperthreading
CPU affinity
Myth: Set CPU affinity to improve VM performance
CPU affinity implications
CPU affinity restricts scheduling freedom. VM will accrue ready time if
the pinned CPU is not available for scheduling
On NUMA system setting CPU affinity disables NUMA scheduling. VM
performance will suffer if memory is allocated on the remote node
On Hyperthreadedsystem CPU affinity binds the VM to Logical CPU
ESX tries to balance Interrupts. Setting CPU affinity to a physical CPU
where interrupts occur frequently can impact performance
Fact: Setting CPU affinity could impact performance
Virtual SMP Performance
Myth: Virtual SMP improves performance of allCPU bound applications
SMP performance
Single threaded application cannot use more than one CPU at a time
Single thread may ping pong between the virtual CPUs
Incurs virtualization overhead, pinning the thread to a vcpu
helps in this case
Co-Scheduling Overhead
Multiple Idle physical CPUs may not be available when the VM
wants to run (VM may accumulate ready time)
Fact: Virtual SMP does not improve performance of single-threaded
applications
Ready Time (1 of 2)
Myth: Ready time should be zero when CPU usage is low
VM state
running (%used)
Run
waiting (%twait)
ready to run (%ready)
Wait
Ready
When does a VM go to “ready to run”state
Guest wants to run or needs to be woken up (to deliver an interrupt)
CPU unavailable for scheduling the VMRunReadyWait
Ready Time (2 of 2)
o Factors affecting CPU availability
CPU over commitment
Even Idle VMs have to be scheduled periodically to deliver timer
interrupts
NUMA constraints
NUMA node locality gives better performance
Burstiness – Inter-related workloads
Tip: Use host anti affinity rules to place inter related workloads on
different hosts
Co-scheduling constraints
CPU affinity restrictions
Fact: Ready time could exist even when CPU usage is low
Transparent Page Sharing
Myth: Disable Transparent page sharing to improve performance
Transparent page sharing
VMkernel scans memory for Identical pages and collapses them into a
single page
Copy on write is performed if a shared page is modified
OSes have static code pages that rarely change
Number of shared pages becomes significant with more consolidation
Huge win for memory over commitment
The default scanning rate is low and incurs negligible overhead (<1%)
Fact: Transparent page sharing does not affect performance adversely
and it improves performance under memory over commitment
Memory Ballooning
Myth: Disable balloon driver to increase VM performance
How Ballooning works
Balloon driver gives / takes away memory from the guest under
memory pressure
Rate of reclamation of memory is determined by memory shares
In its default configuration, memory shares is proportional to VM
memory size
Memory is reclaimed forcibly by swapping if balloon driver is not
installed
Tip: To avoid swapping/ballooning use memory reservation
Fact: Disabling ballooning driver severely affects VM performance under
memory over-commitment
Hyperthreading
Myth: Hyperthreading hurts ESX performance
Hyperthreading support in ESX
Hyperthreading increases the available number of CPUs for scheduling
SMP VMs use logical CPUs from different physical CPUs whenever
possible
Scheduler temporarily quarantines a VM from logical CPU if it
misbehaves (Cache trashing)
Frequently misbehaving VMs can be selectively excluded from
hyperthreading with htsharing=none option
Fact: Hyperthreading improves overall performance by reducing ready
time
Additional References
Best practices Using VMware Virtual SMP
http://www.vmware.com/pdf/vsmp_best_practices.pdf
Performance tuning best practices for ESX server 3
http://www.vmware.com/pdf/vi_performance_tuning.pdf
ESX Resource Management Guide
http://www.vmware.com/pdf/esx_resource_mgmt.pdf