This post is strictly related to vSAN (Software defined storage solution) <= 6.7 version from VMware. Specified the version as I got no chance to test this on 7.x vSphere.
vSAN is a cluster level feature offered by VMware which is tightly integrated with ESXi kernel to be able to provide comprehensive storage solution for vSphere virtual environment. It has its own file system, vSAN stores and manages data in the form of flexible data containers called objects. An object is a logical volume that has its data and metadata distributed across the cluster.
Objects are divided into following categories:
VM Home Namespace
VM Swap object
VMDK
Snapshot Delta VMDKs
Memory objects
Virtual Machine Compliance Status: Compliant and
Noncompliant
A virtual machine is considered noncompliant when one or
more of its objects fail to meet the requirements of its assigned storage
policy. For example, the status might become noncompliant when one of the
mirror copies is inaccessible. If your virtual machines are in compliance with
the requirements defined in the storage policy, the status of your virtual
machines is compliant. From the Physical Disk Placement tab on the Virtual
Disks page, you can verify the virtual machine object compliance status. For
information about troubleshooting a vSAN cluster, see vSAN Monitoring and
Troubleshooting.
Following are vSAN terminology related to objects.
Component State: Degraded and Absent States
vSAN acknowledges the following failure states for
components:
Degraded. A component is Degraded when vSAN detects a
permanent component failure and determines that the failed component cannot
recover to its original working state. As a result, vSAN starts to rebuild the
degraded components immediately. This state might occur when a component is on
a failed device.
Absent. A component is Absent when vSAN detects a
temporary component failure where components, including all its data, might
recover and return vSAN to its original state. This state might occur when you
are restarting hosts or if you unplug a device from a vSAN host. vSAN starts to
rebuild the components in absent status after waiting for 60 minutes.
Object State: Healthy and Unhealthy
Depending on the type and number of failures in the cluster,
an object might be in one of the following states:
Healthy. When at least one full RAID 1 mirror is
available, or the minimum required number of data segments are available, the
object is considered healthy.
Unhealthy. An object is considered unhealthy when no
full mirror is available or the minimum required number of data segments are
unavailable for RAID 5 or RAID 6 objects. If fewer than 50 percent of an
object's votes are available, the object is unhealthy. Multiple failures in the
cluster can cause objects to become unhealthy. When the operational status of
an object is considered unhealthy, it impacts the availability of the
associated VM.
CMMDS compliance config status:
Object health status:
|
Object Health Status |
Description |
|
5 |
Healthy |
|
6 |
Absent |
|
9 |
Degrade |
|
10 |
Reconfiguring |
I have created a script which is used to get following detail when vCenter (< 6.7 version) is not accessible:
Host maintenance mode status of ESXi being used to run the script.
ESXi version and cluster hosts UUIDs
CMMDS member information
Object health status
Cluster resync status
Number of compliant or config status 7 objects.
List of objects with their config or compliance status other than “7” and their CMMDS database
information along with their object attributes detail.
List of unhealthy object e.g. reduced availability etc and their CMMDS database information along
with their object attributes detail.
Script:
#!/bin/sh
echo
"////////////////////////////////////////////////////////////////////////////"
echo
"///////////////////////////Version 0.1/////////////////////////////////////"
echo
"///////////// This script is created by Kapil Soni
//////////////////////////"
echo
"////////////////////////////////////////////////////////////////////////////"
echo
"////////////////////////////////////////////////////////////////////////////"
echo "Running the script...."
sleep 2
echo ""
echo "System information :===========";
esxcli system version get | sed 's/^ *//';
echo ""
echo "Hosts list with UUIDs :"
cmmds-tool find -f json -t HOSTNAME | grep -E "uuid|content" | sed 'N;s/\n/ /' | awk -F \" '{print $10": " $4}'
echo ""
echo "Checking if this host is a part of cmmds :"
cmmds-tool amimember
echo ""
echo host maintenance status : ;vim-cmd hostsvc/hostsummary | grep -i maintenance | sed 's/^ *//; s/ //g ; s/inMaintenanceMode=//';echo ""
esxcli vsan debug object health summary get;echo ""
echo Cluster resync summary : ;esxcli vsan debug resync summary get | sed 's/^ *//g';echo ""
echo "Checking State of objects in vSAN :"
echo "//////////////////////Legends//////////////////////"
echo State 13 :Object Not Recoverable but LAST Good Mirror Available
echo State 12 :No Recovery Possible
echo State 7 :Good
echo State 15:Object Available but Policy not Compliant
echo "////////////////////////////////////////////////////"
echo ""
echo Number of State 7 objects:
cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 7' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//' | wc -l;echo ""
echo State 12 objects:
cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 12' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'
echo "Object detail:"
for i in `cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 12' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'`; do echo object $i; cmmds-tool find -f json -u $i | grep -iE "owner|type|state|health" ;echo "";/usr/lib/vmware/osfs/bin/objtool getAttr -u $i | grep -iE "object type|object size|user|class|object capabil*|path";echo ================================; done;echo ""
echo ""
echo State 13 objects:
cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 13' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'
echo "Object detail:"
for i in `cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 13' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'`; do echo object $i; cmmds-tool find -f json -u $i | grep -iE "owner|type|state|health" ;echo "";/usr/lib/vmware/osfs/bin/objtool getAttr -u $i | grep -iE "object type|object size|user|class|object capabil*|path";echo ================================; done;echo ""
echo ""
echo State 15 objects:
cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 15' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'
echo "Object detail:"
for i in `cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 15' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'`; do echo object $i; cmmds-tool find -f json -u $i | grep -iE "owner|type|state|health" ;echo "";/usr/lib/vmware/osfs/bin/objtool getAttr -u $i | grep -iE "object type|object size|user|class|object capabil*|path";echo ================================; done
echo ""
echo ================================
echo Objects with reduced availability or unhealthy objects:
esxcli vsan health cluster get -t "vSAN object health" | grep -i reduced-availability-wit* | awk '{print $3}' | sed 's/,/\n/g'
echo ""
for obj in $(esxcli vsan health cluster get -t "vSAN object health" | grep -i reduced-availability-wit* | awk '{print $3}' | sed 's/,/\n/g');do echo Object $obj; echo Its CMMDS information:;cmmds-tool find -f json -u $obj | grep -EC 5 "CONFIG_STATUS|DOM_OBJECT";echo "";echo Object attributes information:; /usr/lib/vmware/osfs/bin/objtool getAttr -u $obj | grep -iE "object type|object size|user|class|object capabil*|path"; echo =================================================; done
It can help reduce the manual efforts in finding which an all object/component is unhealthy/non-compliant and where it is residing, what is the object type, size or which component is having issue and lot more as mentioned above. Once we have this detail we can take the next course of action accordingly whether a object/component needs to be recreated/repaired (with the help of VMware engagement).
I will be creating similar more complex script to be able to identify usual issues with vSAN cluster and their solution as well.






