BashPyVirtualization: October 2020

This post is strictly related to vSAN (Software defined storage solution) <= 6.7 version from VMware. Specified the version as I got no chance to test this on 7.x vSphere.

vSAN is a cluster level feature offered by VMware which is tightly integrated with ESXi kernel to be able to provide comprehensive storage solution for vSphere virtual environment. It has its own file system, vSAN stores and manages data in the form of flexible data containers called objects. An object is a logical volume that has its data and metadata distributed across the cluster.

Objects are divided into following categories:

VM Home Namespace

VM Swap object

VMDK

Snapshot Delta VMDKs

Memory objects

Virtual Machine Compliance Status: Compliant and Noncompliant

A virtual machine is considered noncompliant when one or more of its objects fail to meet the requirements of its assigned storage policy. For example, the status might become noncompliant when one of the mirror copies is inaccessible. If your virtual machines are in compliance with the requirements defined in the storage policy, the status of your virtual machines is compliant. From the Physical Disk Placement tab on the Virtual Disks page, you can verify the virtual machine object compliance status. For information about troubleshooting a vSAN cluster, see vSAN Monitoring and Troubleshooting.

Following are vSAN terminology related to objects.

Component State: Degraded and Absent States

vSAN acknowledges the following failure states for components:

Degraded. A component is Degraded when vSAN detects a permanent component failure and determines that the failed component cannot recover to its original working state. As a result, vSAN starts to rebuild the degraded components immediately. This state might occur when a component is on a failed device.

Absent. A component is Absent when vSAN detects a temporary component failure where components, including all its data, might recover and return vSAN to its original state. This state might occur when you are restarting hosts or if you unplug a device from a vSAN host. vSAN starts to rebuild the components in absent status after waiting for 60 minutes.

Object State: Healthy and Unhealthy

Depending on the type and number of failures in the cluster, an object might be in one of the following states:

Healthy. When at least one full RAID 1 mirror is available, or the minimum required number of data segments are available, the object is considered healthy.

Unhealthy. An object is considered unhealthy when no full mirror is available or the minimum required number of data segments are unavailable for RAID 5 or RAID 6 objects. If fewer than 50 percent of an object's votes are available, the object is unhealthy. Multiple failures in the cluster can cause objects to become unhealthy. When the operational status of an object is considered unhealthy, it impacts the availability of the associated VM.

CMMDS compliance config status:

Object health status:

Object Health Status	Description
5	Healthy
6	Absent
9	Degrade
10	Reconfiguring

I have created a script which is used to get following detail when vCenter (< 6.7 version) is not accessible:

Host maintenance mode status of ESXi being used to run the script.

ESXi version and cluster hosts UUIDs

CMMDS member information

Object health status

Cluster resync status

Number of compliant or config status 7 objects.

List of objects with their config or compliance status other than “7” and their CMMDS database

information along with their object attributes detail.

List of unhealthy object e.g. reduced availability etc and their CMMDS database information along

with their object attributes detail.

Script:

#!/bin/sh

echo "////////////////////////////////////////////////////////////////////////////"

echo "///////////////////////////Version 0.1/////////////////////////////////////"

echo "///////////// This script is created by Kapil Soni //////////////////////////"

echo "////////////////////////////////////////////////////////////////////////////"

echo "Running the script...."

sleep 2

echo ""

echo "System information :===========";

esxcli system version get | sed 's/^ *//';

echo ""

echo "Hosts list with UUIDs :"

cmmds-tool find -f json -t HOSTNAME | grep -E "uuid|content" | sed 'N;s/\n/ /' | awk -F \" '{print $10": " $4}'

echo ""

echo "Checking if this host is a part of cmmds :"

cmmds-tool amimember

echo ""

echo host maintenance status : ;vim-cmd hostsvc/hostsummary | grep -i maintenance | sed 's/^ *//; s/ //g ; s/inMaintenanceMode=//';echo ""

esxcli vsan debug object health summary get;echo ""

echo Cluster resync summary : ;esxcli vsan debug resync summary get | sed 's/^ *//g';echo ""

echo "Checking State of objects in vSAN :"

echo "//////////////////////Legends//////////////////////"

echo State 13 :Object Not Recoverable but LAST Good Mirror Available

echo State 12 :No Recovery Possible

echo State 7 :Good

echo State 15:Object Available but Policy not Compliant

echo "////////////////////////////////////////////////////"

echo ""

echo Number of State 7 objects:

echo State 12 objects:

echo "Object detail:"

echo ""

echo State 13 objects:

echo "Object detail:"

echo ""

echo State 15 objects:

echo "Object detail:"

echo ""

echo ================================

echo Objects with reduced availability or unhealthy objects:

esxcli vsan health cluster get -t "vSAN object health" | grep -i reduced-availability-wit* | awk '{print $3}' | sed 's/,/\n/g'

echo ""

for obj in $(esxcli vsan health cluster get -t "vSAN object health" | grep -i reduced-availability-wit* | awk '{print $3}' | sed 's/,/\n/g');do echo Object $obj; echo Its CMMDS information:;cmmds-tool find -f json -u $obj | grep -EC 5 "CONFIG_STATUS|DOM_OBJECT";echo "";echo Object attributes information:; /usr/lib/vmware/osfs/bin/objtool getAttr -u $obj | grep -iE "object type|object size|user|class|object capabil*|path"; echo =================================================; done

It can help reduce the manual efforts in finding which an all object/component is unhealthy/non-compliant and where it is residing, what is the object type, size or which component is having issue and lot more as mentioned above. Once we have this detail we can take the next course of action accordingly whether a object/component needs to be recreated/repaired (with the help of VMware engagement).

I will be creating similar more complex script to be able to identify usual issues with vSAN cluster and their solution as well.

Please be social and share it in your circle. Thank you.

Reference: VMware Docs

BashPyVirtualization

Pages

Monday, October 19, 2020

vSAN Part - I

vSphere Series

vSphere Network Performance Troubleshooting - Part III

Blog Archive

Report Abuse