Saturday, November 21, 2020

vSphere Network Script


This time I thought of doing something related to vSphere networking as while working on cases like packet drops issue, network connectivity issue and performance issues, we are also required to check every aspect of underlying network associated to cluster nodes along with underlying storage so that we can ask or point out efficiently whether we have physical or logical network issue to troubleshoot. 

Attached script is a one shot solution to get all necessary information in a manner for us to efficiently capture the possible issue if at all it exists. Following traits can be gathered in few moments without wasting hours on troubleshooting. The script needs to be run on ESXi. 


- ESXi information e.g. system and few VIBs version, status of virtual machines running on node

- Port numbers of the connected client associated with particular vSwitch in ESXi.

- Status and Stats of ports associated with vSwitch           

- Client name, its MAC & Client port type e.g. 3 (vmkernel), 4 (PNIC) or 5 (virtual NIC)

- Tx and Rx of packets associated to ports related to individual vSwitches.

- Dropped packets associated with them.

- VLAN stats associated to every vmnic e.g. Tx, Rx packets.

- World-ID of running virtual machines.

- Ports associated with the given world-ID of virtual machines.

- Retrieve the vSwitch statistics of packet transmitted and received for ports.

- Received and transmitted packet drop for ports.

- Retrieves filter stats and information for ports.

- vSwitch associated with the port and detail of uplink assigned.


Along with above information the script is also capable of capturing the following:  NIC being used, NIC driver information, their advanced parameter which can be used to modify/set properties, Rx/Tx ring buffers stats, Rx Mini (is for undersized frames), Rx jumbo (jumbo for oversized) and packet summary and stats. 


#!/bin/sh

echo "//////////////////////////////Version 0.1///////////////////////////////////"

echo "///////////////// This script is created by Kapil Soni /////////////////////"

echo "////////////////////////////////////////////////////////////////////////////"

echo "////////////////////////////////////////////////////////////////////////////"

echo "Running the script...."

sleep 2

echo ""; echo "System information :===========";esxcli system version get | sed 's/^ *//'; echo ""; esxcli hardware platform get;echo "";echo "Some VIBs:===========";esxcli software vib list | grep -iE "marvin|vxrail|nsx|ptagent|ism|vsan";echo "";echo host maintenance status : ;vim-cmd hostsvc/hostsummary | grep -i maintenance | sed 's/ //g';echo "";for vid in $(vim-cmd vmsvc/getallvms | awk '{print $1}' | grep -vi Vmid);do echo vmid : $vid;vim-cmd vmsvc/power.getstate $vid;echo "";done;echo Here is the vm list;esxcli network vm list;echo "";vim-cmd vmsvc/getallvms | awk '{print $1,$2}';echo "";esxcli vsan cluster get;echo "";esxcli vsan network list;echo "";echo "";echo Cluster resync summary : ;esxcli vsan debug resync summary get;esxcli vsan debug object health summary get;echo "";echo vsan health red alerts : ;esxcli vsan health cluster list | grep -i red

echo "";echo =============================================================;echo ""

echo "vmnics detail:";esxcli network nic list; for i in $(esxcli network nic list | grep -i vmnic* | awk '{print $1}');do echo "Gathering the information for:";echo $i;vmkchdev -l | grep $i | awk '{printf "PCIslot "$1 "\n" "VID&DID "$2 "\n" "SVID&SDID "$3 "\n" "Add.Info "$4 " " $5"\n"}';esxcli network nic get -n $i | egrep -i "name|link*|cable|virtual|driver|version";echo Dropped packets and errors are :;esxcli network nic stats get -n $i | grep -iE "vmnic|dropped|errors";echo "";echo "Enabling the VLAN stats on $i for now:=========";esxcli network nic vlan stats set -e true -n $i;echo The VLAN stats for $i:;esxcli network nic vlan stats get -n $i;echo "";echo "Disabling the VLAN stats on $i:===============";esxcli network nic vlan stats set -e false -n $i;echo "";done;echo "";echo IPv4 address associated with VMK ports:;esxcli network ip interface ipv4 get

echo "";echo =============================================================;echo ""

net-stats -l;echo "";for switch in $(net-stats -l | awk '{print $4}' | grep -vi switchname | uniq);do echo "For this switch: $switch:========================";for port in $(net-stats -l | awk '{print $1}' | grep -vi portnum); do echo ""; echo Switch $switch and port $port:;echo "Status :";vsish -e cat /net/portsets/$switch/ports/$port/status 2>/dev/null | grep -i client ;echo "";echo "Stats :";vsish -e cat /net/portsets/$switch/ports/$port/stats 2>/dev/null | grep -iv "packet stats";done;done

echo "";echo =============================================================;echo ""

for nic in $(esxcli network nic list | grep -i vmnic* | awk '{print $1}');do echo "";echo Nic $nic :;echo Max supported ring buffer:;esxcli network nic ring preset get -n $nic;echo "";echo Currently set ring buffer:;esxcli network nic ring current get -n $nic;echo "";echo "Current ring stats of $nic"; vsish -e cat /net/pNics/$nic/stats | grep -iE "rxq[0-9]|txq[0-9]";echo "=========================";done;echo "";echo List of NIC drivers:;for drvr in $(esxcli network nic list | awk '{print $3}' | grep -viE "device|------" | uniq);do echo "====================";echo -n Driver name:;esxcli system module list | grep $drvr | awk '{print $1}';esxcli system module get -m $drvr | sed 's/^ *//';echo "";echo Parameters list:;esxcli system module parameters list -m $drvr;done

echo "";echo =============================================================;echo ""

echo VMs with world id : ; esxcli network vm list;echo "";for vm in $(esxcli network vm list | awk '{print $1}' | grep -viE "world|--------");do echo "For this world id :$vm following are the ports associated :==========" ;esxcli network vm port list -w $vm | grep -iw "port id" | grep -vi "uplink" | sed 's/   //' | awk '{print $3}';for j in $(esxcli network vm port list -w $vm| grep -iw "port id" | grep -vi "uplink" | sed 's/   //' | awk '{print $3}');do echo "";esxcli network port stats get -p $j;echo Port filter stats for $j:;esxcli network port filter stats get -p $j;done;echo "";esxcli network vm port list -w $vm | grep -iEw "port id|vswitch|team uplink|mac address";echo "==========================";done


Snippets:











This post has been about serving a single Bash script to have all the network related information in order to troubleshoot underlying Network issue in our vSphere environment. Script needs to be run on ESXi shell (or remotely via SSH) to collect and easily identify possible network issue, we may also redirect script output to text file as stdout will be lengthy to go through. 

I am open for suggestions/feedback/improvements in order to provide intended solution. I am working on few more scripts, also upgrading the existing ones in order to accommodate few more actions to ease our day to day life.  Thank you for reading through. Please be social and share it socially if found useful. 










Monday, October 19, 2020

vSAN Part - I

 

This post is strictly related to vSAN (Software defined storage solution) <= 6.7 version from VMware. Specified the version as I got no chance to test this on 7.x vSphere.

vSAN is a cluster level feature offered by VMware which is tightly integrated with ESXi kernel to be able to provide comprehensive storage solution for vSphere virtual environment. It has its own file system, vSAN stores and manages data in the form of flexible data containers called objects. An object is a logical volume that has its data and metadata distributed across the cluster.

Objects are divided into following categories:

    VM Home Namespace

    VM Swap object

    VMDK

    Snapshot Delta VMDKs

    Memory objects


Virtual Machine Compliance Status: Compliant and Noncompliant

A virtual machine is considered noncompliant when one or more of its objects fail to meet the requirements of its assigned storage policy. For example, the status might become noncompliant when one of the mirror copies is inaccessible. If your virtual machines are in compliance with the requirements defined in the storage policy, the status of your virtual machines is compliant. From the Physical Disk Placement tab on the Virtual Disks page, you can verify the virtual machine object compliance status. For information about troubleshooting a vSAN cluster, see vSAN Monitoring and Troubleshooting.

Following are vSAN terminology related to objects. 

Component State: Degraded and Absent States

vSAN acknowledges the following failure states for components:

Degraded. A component is Degraded when vSAN detects a permanent component failure and determines that the failed component cannot recover to its original working state. As a result, vSAN starts to rebuild the degraded components immediately. This state might occur when a component is on a failed device.

Absent. A component is Absent when vSAN detects a temporary component failure where components, including all its data, might recover and return vSAN to its original state. This state might occur when you are restarting hosts or if you unplug a device from a vSAN host. vSAN starts to rebuild the components in absent status after waiting for 60 minutes.

Object State: Healthy and Unhealthy

Depending on the type and number of failures in the cluster, an object might be in one of the following states:

Healthy. When at least one full RAID 1 mirror is available, or the minimum required number of data segments are available, the object is considered healthy.

Unhealthy. An object is considered unhealthy when no full mirror is available or the minimum required number of data segments are unavailable for RAID 5 or RAID 6 objects. If fewer than 50 percent of an object's votes are available, the object is unhealthy. Multiple failures in the cluster can cause objects to become unhealthy. When the operational status of an object is considered unhealthy, it impacts the availability of the associated VM.

CMMDS compliance config status:






Object health status:

Object Health Status

Description

5

Healthy

6

Absent

9

Degrade

10

Reconfiguring

 

I have created a script which is used to get following detail when vCenter (< 6.7 version) is not accessible:

    Host maintenance mode status of ESXi being used to run the script.

    ESXi version and cluster hosts UUIDs

    CMMDS member information

    Object health status

    Cluster resync status

    Number of compliant or config status 7 objects.

    List of objects with their config or compliance status other than “7” and their CMMDS database

    information along with their object attributes detail.

    List of unhealthy object e.g. reduced availability etc and their CMMDS database information along

    with their object attributes detail.


Script:


#!/bin/sh

 

echo "////////////////////////////////////////////////////////////////////////////"

echo "///////////////////////////Version 0.1/////////////////////////////////////"

echo "///////////// This script is created by Kapil Soni //////////////////////////"

echo "////////////////////////////////////////////////////////////////////////////"

echo "////////////////////////////////////////////////////////////////////////////"

echo "Running the script...."

sleep 2

echo ""

echo "System information :===========";

esxcli system version get | sed 's/^ *//';

echo ""

echo "Hosts list with UUIDs :"

cmmds-tool find -f json -t HOSTNAME | grep -E "uuid|content" | sed 'N;s/\n/ /' | awk -F \" '{print $10": " $4}'

echo ""

echo "Checking if this host is a part of cmmds :"

cmmds-tool amimember

echo ""

echo host maintenance status : ;vim-cmd hostsvc/hostsummary | grep -i maintenance | sed 's/^ *//; s/ //g ; s/inMaintenanceMode=//';echo ""

esxcli vsan debug object health summary get;echo ""

echo Cluster resync summary : ;esxcli vsan debug resync summary get | sed 's/^ *//g';echo ""

echo "Checking State of objects in vSAN :"

echo "//////////////////////Legends//////////////////////"

echo State 13 :Object Not Recoverable but LAST Good Mirror Available

echo State 12 :No Recovery Possible

echo State 7 :Good

echo State 15:Object Available but Policy not Compliant

echo "////////////////////////////////////////////////////"

echo ""

echo Number of State 7 objects:

cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 7' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//' | wc -l;echo ""

echo State 12 objects:

cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 12' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'

echo "Object detail:"

for i in `cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 12' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'`; do echo object $i; cmmds-tool find -f json -u $i | grep -iE "owner|type|state|health" ;echo "";/usr/lib/vmware/osfs/bin/objtool getAttr -u $i | grep -iE "object type|object size|user|class|object capabil*|path";echo ================================; done;echo ""

echo ""

echo State 13 objects:

cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 13' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'

echo "Object detail:"

for i in `cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 13' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'`; do echo object $i; cmmds-tool find -f json -u $i | grep -iE "owner|type|state|health" ;echo "";/usr/lib/vmware/osfs/bin/objtool getAttr -u $i | grep -iE "object type|object size|user|class|object capabil*|path";echo ================================; done;echo ""

echo ""

echo State 15 objects:

cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 15' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'

echo "Object detail:"

for i in `cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -A 1 'state\\\":\ 15' | grep uuid | sed 's/"uuid": "//g ; s/",//g ; s/^ *//'`; do echo object $i; cmmds-tool find -f json -u $i | grep -iE "owner|type|state|health" ;echo "";/usr/lib/vmware/osfs/bin/objtool getAttr -u $i | grep -iE "object type|object size|user|class|object capabil*|path";echo ================================; done

echo ""

echo ================================

echo Objects with reduced availability or unhealthy objects: 

esxcli vsan health cluster get -t "vSAN object health" | grep -i reduced-availability-wit* | awk '{print $3}' | sed 's/,/\n/g'

echo ""

for obj in $(esxcli vsan health cluster get -t "vSAN object health" | grep -i reduced-availability-wit* | awk '{print $3}' | sed 's/,/\n/g');do echo Object $obj; echo Its CMMDS information:;cmmds-tool find -f json -u $obj | grep -EC 5 "CONFIG_STATUS|DOM_OBJECT";echo "";echo Object attributes information:; /usr/lib/vmware/osfs/bin/objtool getAttr -u $obj | grep -iE "object type|object size|user|class|object capabil*|path"; echo =================================================; done

 









It can help reduce the manual efforts in finding which an all object/component is unhealthy/non-compliant and where it is residing, what is the object type, size or which component is having issue and lot more as mentioned above. Once we have this detail we can take the next course of action accordingly whether a object/component needs to be recreated/repaired (with the help of VMware engagement). 

I will be creating similar more complex script to be able to identify usual issues with vSAN cluster and their solution as well. 


Please be social and share it in your circle.  Thank you. 



Reference: VMware Docs

Friday, September 4, 2020

vSphere Network Performance Troubleshooting - Part IV

 

Now that we have come to an end to our vSphere Network Performance Troubleshooting Series. Some information about NIC drivers and ring buffers are good to have and thus have been provided in this post. In order to understand ring buffers, it may be good to understand the DMA as well.

DMA, a hardware mechanism that allows peripheral components to transfer their I/O data directly to and from main memory without the need to involve the system processor. Use of this mechanism can greatly increase throughput to and from a device, because a great deal of computational overhead is eliminated. Hardware support is required – DMA controllers DMA “steals” cycles from the processor. Synchronization mechanisms must be provided to avoid accessing non-updated information from RAM.

The DMA ring allows the NIC to directly access the memory used by the software. The software (NIC's driver in the kernel case) is allocating memory for the rings and then mapping it as DMA memory, so the NIC would know it may access it. TX packets will be created in this memory by the software and will be read and transmitted by the NIC (usually after the software signals the NIC it should start transmitting). RX packets will be written to this memory by the NIC and will be read and processed by the software (usually after an interrupt is issued to signal there's work).

Ring Buffer Contains Start and End Address of Buffer in RAM. TX Ring will contain addresses of Buffer in RAM that contains data to be transmitted. RX Ring will contain address of Buffer in RAM where NIC will place data. NIC ring buffer sizes vary per NIC vendor and NIC grade.

These rings are present in RAM. TX buffer and RX buffer are in RAM pointed by TX/RX rings. Network Card Register has Location of Rings Buffer in RAM. These can be DMA buffers and are called DMA TX/RX ring and DMA TX/RX buffer.

Basically, DMA ring buffer and TX/RX rings are the same thing. DMA has two type of ring buffers:

TX ring buffer - used for transmitting data from kernel (NIC driver/software) to device.

RX ring buffer - used for receiving data from device to kernel (NIC driver/software).

 

Following is the script to capture the NIC being used, NIC driver information, their advanced parameter which can be used to modify/set properties, Rx/Tx ring buffers stats, Rx Mini (is for undersized frames), Rx jumbo (jumbo for oversized) and packet summary and stats.

 

for nic in $(esxcli network nic list | grep -i vmnic* | awk '{print $1}');do echo "";echo Nic $nic :;echo Max supported ring buffer:;esxcli network nic ring preset get -n $nic;echo "";echo Currently set ring buffer:;esxcli network nic ring current get -n $nic;echo "";echo "Current ring stats of $nic"; vsish -e cat /net/pNics/$nic/stats | grep -iE "rxq[0-9]|txq[0-9]";echo "=========================";done;echo "";echo List of NIC drivers:;for drvr in $(esxcli network nic list | awk '{print $3}' | grep -viE "device|------" | uniq);do echo "====================";echo -n Driver name:;esxcli system module list | grep $drvr | awk '{print $1}';esxcli system module get -m $drvr | sed 's/^ *//';echo "";echo Parameters list:;esxcli system module parameters list -m $drvr;echo "";done

 

Output :








You can increase the size of the Ethernet device RX ring buffer if the packet drop rate causes applications to report loss of data, slow performance or time outs. 

The exhaustion of the RX ring buffer causes an increment in the counters, such as "discard" or "drop" in the output of NIC stats. The discarded packets indicate that the available buffer is filling up faster than the ESXi kernel or NIC driver can process the packets. Increase the RX ring buffer to reduce a high packet drop rate. Depending on the driver your network interface card uses, changing in the ring buffer can shortly interrupt the network connection. 

By this point we could easily detect the bottle neck in the network or main issue in complete network path (vsphere), you should be able to identify if your physical switches needs attention about various parameters e.g. MTU settings, traffic or bandwidth constraints related to vSphere Distributed switch. 

There may be more to dig and dive further in order to identify the culprits behind sluggish network performance or latency issue. I am leaving it to this point and will be taking this topic up again if have found additional actions to investigate further. 

This has been pretty cliché series for me as all I had to was just transforming all manual steps into Bash script which then can be used to troubleshoot possible network latency, network performance and packet drops issue in a vSphere/VMware SDDC environment. 

I know few of you may think about the hefty theory given above in this last post unlike my previous ones where I was up to the mark about what you have to do to get something from ESXi shell but its important to understand the basic concepts before you dive into something. 

I hope you all had nice learning experience with this series. 

Gonna be posting soon in other topics also...stay tuned till then.


Thanks for reading, be social and share it in your circle.

 

 

 

Link to Page - vSphere




Tuesday, September 1, 2020

vSphere Network Performance Troubleshooting - Part III


As I stated in my last post about utilizing the net-stats and vsish (vmkernel sys info shell) to gather useful network related information, here we will get :

·        Port number

·        Switch name

·        MAC Address

·        Client port status of ports associated with switch

·        Client port stats of ports associated with switch

·        Client name

·        Client port type e.g. 3 (vmkernel), 4 (PNIC) or 5 (virtual NIC)

·        Tx and Rx of packets associated to ports related to individual switches.

·        Dropped packets associated with them.


With following script you can easily get all of these information in one go:

net-stats -l;echo "";for switch in $(net-stats -l | awk '{print $4}' | grep -vi switchname | uniq);do echo "For this switch: $switch:========================";for port in $(net-stats -l | awk '{print $1}' | grep -vi portnum); do echo ""; echo Switch $switch and port $port:;echo "Status :";vsish -e cat /net/portsets/$switch/ports/$port/status 2>/dev/null | grep -i client ;echo "";echo "Stats :";vsish -e cat /net/portsets/$switch/ports/$port/stats 2>/dev/null | grep -iv "packet stats";done;done

Output:








Ports which are not associated to given switch will not have any info against port stats and port status.

The last part of this series will be containing another useful script that can dig the nic driver related information, ring buffer associated, default ring configuration and getting advanced module parameters for given NIC drivers. 

Stay tuned....till then. 

Thanks for reading, be social and share it in your circle if found useful. 


Link to Page - vSphere



vSphere Series

vSphere Network Performance Troubleshooting - Part III

As I stated in my last post about utilizing the net-stats and vsish (vmkernel sys info shell) to gather useful network related information...