Adding Microseconds to vRealize Operations Graphs

In my old role as a vSphere administrator for a single company, we upgraded our storage from legacy spinning disk with a small amount of cache. This environment often experienced disk latency greater than 3ms on average and often time much higher than that, getting up to 20ms or higher for 30 seconds at a time, over a thousand times a day. We upgraded to a hybrid array with results of less than a millisecond latency for read and write operations…consistently.

We tracked this with vCenter as well as vCenter Operations Manager. In vCenter, it’s tracked on a per virtual machine instance. We could monitor this by using the vCenter performance tab on a virtual machine and selecting Virtual Disk and monitoring the read and write latency numbers for both Milliseconds and Microseconds, as seen below.

Monitoring in vCenter was great, but it’s only available for live data, or the last hour. That doesn’t help in seeing history. We fixed this by looking into vCenter Ops and in our installation, the numbers were there. Great…history. We can now use this latency number as justification for our new storage selection.

Fast forward to my current role, as a consultant. I recently installed vRealize Operations Manager for a customer. One of the reasons was to monitor disk performance as they evaluated new storage platforms for their virtual desktop environment. As I looked into vCenter, the counters were there, but when I looked into vRealize Ops the counter wasn’t available. What gives?

After contacting my previous coworker, we compared configurations and neither one of us could figure it out. I already had a case open with VMware regarding the vRealize Ops install for an unrelated issue. On a recent call with VMware support, we compared these two vCenter/vRealize Operations manager environments. It really stumped the support engineer as well. After about five minutes, we figured it out. When vCenter Operations Manager was installed, the selection to gather all metrics was selected. vRealize Operations Manager doesn’t give you this options. Instead you need to change the policy.

Here’s How:
Log into vRealize Operations Manager, select a VM, and select the troubleshooting tab. Take a note of the policy that is effect for the VM…if you haven’t tweaked with vROPs, then the policy will be the same for all objects.

Click on this policy, this will link to the administration portion of vROPs focused on the “Policies” section.

Click on the “Policy Library” tab. Drop down the “Base Setting” group, and select the proper policy that was at the top of the virtual machine troubleshooting tab. Then click not the edit pencil.

Once in the “Edit Monitoring Policy” window (shown below), limit the Object type to “Virtual Machine” and filter on “Latency”. This will reduce the object to a single page.

Change the “state” for the Virtual Disk|Read Latency (microseconds) and Virtual Disk|Write Latency (microseconds) to “Local” with a green checkmark.

After about 5 minutes, verify that the data is now displayed in vRealize Ops.

Why is this counter important?
As SSD arrays like Pure Storage and XtremIO or hybrid arrays like Tintri become more prevalent, millisecond latency numbers are near 0 and result in a single flat line and no real data, but microseconds really allows administrators to see any changes in latency that would otherwise be flattened out.

I’d like to thank El Gwhoppo for allowing me to guest author. I hope this tidbit is useful information!

How to Extend XtremIO Volumes

Want to extend an XtremIO volume? Simply right click on it and say modify volume. Pretty darn easy. Was somewhat annoyed to find that this information isn’t exactly readily available from “the Google”, so I decided to write this up after verifying it is a non-destructive process from my EMC home boys. This was performed on an XtremIO running on the 3.0.0 build 44 code. Not sure if it will work for shrinking a volume, but perhaps someone can comment up with that information.

UCS, View, Virtual SAN Reference Architecture: Cool! But did you think about…

I really appreciate the article titled “Myth versus Math: Setting the Record Straight on Virtual SAN” as written by Jim Armstrong. He highlights some of the same points I’m going to make in this post. Virtual SAN is a very scalable solution and can be configured to meet many different types of workloads. Also he notes that reference architectures are a starting point, not complete solutions. I am a big fan of hyper-converged architectures such as Virtual SAN. I’m also a big fan of Cisco based server solutions. However, I’ve also had to correct underperforming designs that were based entirely off of reference architectures, and as such I want to ensure that all the assumptions are accounted for with a specific design. One of the reasons I’m writing this is because I have recently delivered a Horizon View Plan and Design which included a Virtual SAN on Cisco hardware, so I wrestled with these assumptions not too long ago. The secondary reason is because this architecture utilizes “Virtual SAN Ready Nodes”, specifically as designed for virtual desktop workloads. Using this vernacular brings trust and immediate, “Oh neat no thinking required for sizing” to most, but I would argue you still have to put on your thinking cap and make sure you are meeting the big picture solution requirements, especially for a greenfield deployment.

The reference architecture that I recently read is linked below.

http://blogs.vmware.com/tap/2015/01/horizon-6-virtual-san-cisco-ucs-reference-architecture.html

http://www.cisco.com/c/en/us/solutions/collateral/data-center-virtualization/desktop-virtualization-solutions-vmware-horizon-view/whitepaper_C11-733480.pdf

Remember, reference architectures establish what is possible with the technology, not necessarily what will work in your house. Let’s go through the assumptions in the above setup now.

Assumption #1: Virtual Desktop Sizing Minimums are OK

Kind of a “duh”. Nobody is going to use big desktop VMs for a reference architecture because the consolidation ratios are usually worse. If you read this article titled “Server and Storage Sizing Guide for Windows 7 Desktops in a Virtual Desktop Infrastructure” it highlights minimum and recommended configurations for Windows 7, which includes the below table.

Some (not all) of the linked clones were configured in this reference utilize 1.5GB 1 vCPU which is below the recommended value. I’ve moved away from the concept of starting with 1 vCPU for virtual desktops regardless of the underlying clock speed, simply due to the fact that a multi-threaded desktop performs better given that the underlying resources can satisfy the requirements. I have documented this in a few previous blog posts, namely “Speeding Up VMware View Logon Times with Windows 7 and Imprivata OneSign” and “VMware Horizon View: A VCDX-Desktop’s Design Philosophy Update“. In the logon times post, I found that simply bumping from 1 vCPU to 2 reduced 10 seconds in the logon time in a vanilla configuration.

Recommendation: Start with 2 vCPU and at least 2GB from a matter of practice with Windows 7 regardless of virtual desktop type. You can consider bumping some 32 bit use cases down to 1 vCPU if you can reduce the in-guest CPU utilization by doing things like offloading PCoIP with an Apex card, or utilizing a host based vShield enabled AV protection instead of in-guest. Still, you must adhere to the project and organizational software requirements. Size in accordance with your desktop assessment and also POC your applications in VMs sized per your specs. Do not over commit memory.

Assumption #2: Average Disk Latency of 15ms is Acceptable

This reference architecture assumes from a big picture that 15ms of disk latency is acceptable, per the conclusion. My experiences have shown me that between 7ms of average latency is the most I’m willing to risk on behalf of a customer, with the obvious statement “the lower the better”. Let’s examine the 800 linked clone desktop latency. In that example, the average host latency was 14.966ms. Most often times requirement #1 on VDI projects I’m engaged on is acceptable desktop performance as perceived from the end user. In this case, perhaps more disk groups, more magnetic disks or smaller and faster magnetic disks should have been utilized to service the load with reduced latency.

Recommendation: Just because it’s a hyper converged “Virtual SAN Ready Node” doesn’t mean you’re off the hook for the operational design validation. Perform your View Planner or LoginVSI runs and record recompose times on the config before moving to production. When sizing, try to build so that average latency is at or below 7ms (my recommendation), or whatever latency you’re comfortable with. It’s well known that Virtual SAN cache should be at least 10% of the disk group capacity, but hey, nobody will yell at you for doing 20%. I also recommend the use of smaller 15K drives when possible in a Virtual SAN for VDI, since they can handle more IO per spindle, which fits better with linked clones. Need more IO? Time for another disk group per host and/or faster disk. Big picture? You don’t want to dance with the line of “enough versus not enough storage performance” to meet the requirements in a VDI. Size appropriately in accordance with your desktop assessment, beat it up in pre-prod and record performance levels before users are on it. Or else they’ll beat you up.

Assumption #3: You will be handling N+1 with another server not pictured

In the configuration used, N+1 is not accounted for. Per the norm in most reference architectures, it’s taken straight to the limit without thought for failure or maintenance. 800 desktops, 8 hosts. The supported maximum is 100 Desktops/Host on Virtual SAN due to the maximum amount of objects supported per host (3000) and the amount of objects and linked clones generate. So while you can run 800 desktops on 8 hosts, you are officially at capacity. It is noted that several failure techniques were demonstrated in the reference, however the failure of an entire host was not one of them. Simulated failure of SSD Disk, magnetic disk and network were tested but an entire host failure was not simulated.

Recommendation: Size for at least N+1. Ensure you’re taking 10% of the pCores away when running your pCores/vCPU consolidation ratios for desktops to account for the 10% Virtual SAN CPU overhead. Ensure during operational design validation that the failure of a host including a data resync does not increase disk workload in such a fashion that it negatively impacts the experience. In short, power off a host in the middle of a 4 hour View Planner run and watch the vsan.observer for disk latency one hour later during the resync.

Assumption #4: Default policy of FTT=0 for Linked Clones is Acceptable

I disagree with the mindset that FTT=0 is acceptable for linked clones. In my opinion it provides a low quality configuration as the failure of a single disk group can result in lost virtual desktops. The argument in favor of FTT=0 for stateless desktops is “So what? If they’re stateless, we can afford to lose the linked clone disks, all the user data is stored on the network!” Well that’s an assumption and a fairly large shift in thinking especially when trying to pry someone away from the incumbent monolithic array mentality. The difference here is that the availability of the storage on which the desktops reside is directly correlated with the availability of the hosts. If one of the hosts restarts unexpectedly, with FTT=0 the data isn’t really lost, just unavailable while the host is rebooting. However if a disk group suffers data loss, it makes a giant mess of a large number of desktops that you end up having to clean up manually, since the desktops are unable to be removed from the datastore in an automated fashion by View Composer. There’s a reason the KB which describes how to clean up orphaned linked clone desktops is complete with two YouTube videos. This is something I’ve unfortunately done many times, and wouldn’t wish upon anyone. Another point of design concern is that if we’re going to switch the linked clone storage policy from FTT=0 to FTT=1, now we’re inducing a more IO operations which must be accounted for across the cluster during sizing. Another point of design concern is that without FTT=1, the loss of a single host will result in some desktops that get restarted by HA, but also some that may only have their disk yanked. I would rather know for certain that the unexpected reboot of a single host will result only in the state change of desktops located on that host, not a state change of some desktops and the storage availability for others called into question. The big question I have in regards to this default policy is: Why would anyone move to a model that introduces a SPOF on purpose? Most requirements dictate removal of SPOFs wherever possible.

Recommendation: For the highest availability, plan for and utilize FTT=1 for both linked clone and full desktops when using Virtual SAN.

The default linked clone storage policy is pictured below as installed with Horizon View 6.

Assumption #5: You’re going to use CBRC and have accounted for that in your host sizing

Recommendation: Ensure that you take another 2GB of memory off each host when sizing to allow for the CBRC (content based read cache) feature to be enabled.

That’s all I got. No disrespect is meant to the team of guys that put that doc together, but some people read those things as the gold standard when in reality:

AppVolumes: Asking the Tough Questions

I had some of these tough questions about AppVolumes when it first came out. I set out to answer these questions with some of my own testing in my lab. These questions are beyond the marketing and get down to core questions about the technology, ongoing management and integrations. Unfortunately I can’t answer all of them, but these are some of my observations after working in the lab with it and engaging some peers. Much thanks to @Chrisdhalstead for the peer review on this one. Feedback welcome!

Q. Will AppVolumes work running on a VMware Virtual SAN?

A. Yes! My lab is running entirely on the Virtual SAN datastore and AppVolumes has been working great for me on it.

Q. Can I replicate and protect the AppStacks and Writable volumes with vSphere Replication or vSphere Data Protection, or is array based backup/replication the only option?

A. At the time of writing I’m pretty sure the only way to do this is to replicate the AppStacks is with array based replication. Of course, you could create a VM that serves storage to the hosts and replicate that, but that’s not an ideal solution for production. Hopefully more information will come out on this in the near future. This means that currently, there’s no automated supported method that I’m aware of to replicate the volumes when Virtual SAN is used for the AppStacks and writable volumes. Much thanks to the gang on twitter that jumped in on this conversation: @eck79, @jshiplett, @kraytos, @joshcoen, @thombrown

Q. So let’s talk about the DR with AppVolumes. Suppose I replicate the LUNs with array based replication that has my AppStacks and Writable Volumes on it. What then?

A. First of all, this is a big part of the conversation, and from what I hear there are some white papers brewing for this exact topic. I will update this section when they come out. My first thought is for environments that have the highest requirements for uptime which are active/active multi-pod deployments. My gut reaction says that while you can replicate, scan and import AppStacks for the sake of keeping App updating procedures relatively simple, I’m not sure how well that’s going to work with writable volumes when there is a user persona with actively used writable volumes in one datacenter. The same problem exists for DFS, you can only have a user connect from one side at the time, (connecting from different sides simultaneously with profiles is not supported) so we generally need to give users a home base. However, DFS is an automated failover in that replication is constantly synching the user data to the other site and is ready to go. So for that use case, first determine if profile replication is a requirement or if during a failure it’s OK for users to be without profile data. If keeping the data is a requirement, I’d say you should probably utilize a UEM with replicating shares with DFS or something similar. For active/warm standby, the advice is the same. For active/standby models, you’ll have to decide if you’re going to replicate your entire VDI infrastructure and your AppVolumes, or if you’re going to set up infrastructure on both sides.

Q. But after replication how will AppVolumes know what writable volumes belong to what user? Or what apps are in the AppStacks?

A. That information is located in the metadata. It should get picked up when re-scanning the datastore from the Administrative console. I’m sure there’s more to it than that, but I don’t have the means to test a physical LUN replication at the moment. Below is an example of what is inside the metadata files. This is what’s picked up when the datastore is scanned via the admin manager.

Q. What about backups? How can I make sure my AppStacks are backed up?

A. Backing up the SQL database is pretty easy and definitely recommended. To back up the AppStacks, that means that you should consider and evaluate a backup product that will allow for granular backups down to the virtual disk file level on the datastore. This doesn’t include VDP. Otherwise you’re going to have to download the entire directory the old fashioned way, or scp the directory to a backup location on a scheduled basis.

Q. In regards to application application conflicts between AppStacks and Writable Volumes, who wins when different versions of the same app are installed? How can I control it?

A. My understanding is that the last volume that was mounted wins. In the event a user has a writable volume, that is always mounted last. By default, they are mounted in order of assignment according to the activity log in my lab. If there is a conflict between AppStacks, you can change the order in which they are presented. However, you can only change the order for all AppStacks that are assigned to the same group or user. For example, if you have 3 AppStacks and each of them are assigned to 3 different AD groups, there’s no way to control the order of assignment. If 3 AppStacks that are all assigned to the same AD group, you can override the order in which they are presented under the properties of the group.

Q. What happens if I Update an AppStack with an App that a User already installed in their writable volume? For example, if a user installed Firefox 12 in their writable volume, then the base AppStack was updated to include Firefox 23, what happens when the AppStack is updated?

A. My experience in the lab shows that when a writable volume is in play for a user, anything installed there that conflicts will trump what is installed in an AppStack. Also all file associations regardless of applications installed in the underlying AppStacks are trumped by what has been configured in the user profile and persisted in the writable volume. Also, my test showed that if you remove the application from the writable volume and refresh the desktop, the file associations will not change to what is in the base AppStack.

After the update:

Files in the C:\Program Files\ Directory were still version 12.

Q. What determines what is captured via UIA on a user’s writable volume and what is left on the machine?

A. Again, another thing I’m not sure of. I know that there must be some trigger or filter but it is not configurable. I was able to test and verify that all of these files are kept:

All Files in the users’ profile

Wallpapers

Printers

AppData

HKCU Registry Settings

Windows Explorer View Options such as File Extensions, hidden folders

Q. Can updates be performed to the App Stacks without disruption of the users that have the App Stack in use?

A. Yes. While it is true that you must unassign the old version of the AppVolume first, you can choose to wait until next logon/logoff. Then you can assign the new one and also choose to wait until the next logon/logoff. There is a brief moment after you unassign the AppStack that a user might log in and not get their apps, so off hours is probably a good time to do this in prod unless you’re really fast on the mouse.

Q. If one AppStack is associated to one person, what about people that need access to multiple pools or RDSH apps simultaneously?

A. A user can have multiple AppStacks that are used on different machines. You can break it down so they are only attached on computers that start with a certain hostname prefix and OS type.

Q. So if an RDSH computer has an AppStack assigned to the computer, can you use writable volumes for the user profile data?

A. No. It is not supported to have any user based AppStack or writable volumes assigned with a computer based AppStack is assigned to the machine.

Q. So can I still use AppVolumes to present apps with RDSH?

A. Sure you can, it just has to be machine based assignments only; you just can’t use user based assignments or writable volumes at the same time. While we’re on the topic there’s no reason you couldn’t present AppStacks to XenApp farms too, it’s what CloudVolumes originally did. My recommendation is to use a third party UEM product for profiles when utilizing RDSH app delivery, trust me that it’s something you’re going to want granular control over.

Q. What about role based access controls for the AppVolumes management interface? There’s just an administrator group?

A. Correct, there’s no RBAC with the GA release.

Q. Does this integrate with other profile tool solutions?

A. Yes. I’ve used Liquidware Labs Profile Unity to enable automatic privilege elevation for installers located on specified shares or local folders. This way can install their own applications (providing they’re located on a specified share) which will be placed in the writable volume, or they can one-off update apps that have frequent updates without IT intervention right in the writable volume.

Horizon View Composer SQL Login Password Expiration

So the SQL account password expired for your View Composer. Bummer. This must mean you’re here because you’re seeing and googled error messages like this in event Viewer under VMware View Composer:

“[28000][Microsoft][SQL Server Native Client 10.0][SQL Server]Login failed for user ‘<user>’

and this in the %PROGRAMDATA%\VMware View Composer\Logs\vmware-viewcomposer.log

NHibernate.ADOException: cannot open connection —> System.Data.Odbc.OdbcException: ERROR [28000] [Microsoft][SQL Server Native Client 10.0][SQL Server]Login failed for user ‘<user>’. Reason: The password of the account has expired. ERROR [28000] [Microsoft][SQL Server Native Client 10.0][SQL Server]Login failed for user ‘<user>’. Reason: The password of the account has expired.

Good news space heroes, there’s a really well documented VMware KB procedure on how to recover from just such a scenario. Other things to try before doing this procedure:

Verify the SQL server and composer database are online
Verify the View Composer DSN is in place on the composer server
Verify the account is enabled and you can login, testing the DSN with the credentials you used.
Verify the account permissions haven’t changed on the View Composer Database
Verify the database is intact

In the recent case of quickly fixing this that I encountered, we changed the password to the same password and that still didn’t work. Following this procedure of re-creating the SviWebService.exe.config worked. This is because the password is encrypted and stuffed into the config file using the certificate data. The basic steps that are in this article are:

Find the SviWebService.exe.config file in the DRIVE:\Program Files (x86)\VMware View Composer directory
Copy the SviWebService.exe.config file to a temp directory like C:\tmp
Copy the SviWebService.exe.config file to a temp directory like C:\tmp and rename it to SviWebService.exe.config.old
Open a command prompt and cd to DRIVE:\Program Files (x86)\VMware View Composer directory
Open a notepad and paste this command in, make sure to modify the bolded parts to fit your environment:
1. SviConfig -operation=SaveConfiguration -dsnname=DSNNAME -username=USERNAME -password=PASSWORD -passwordchanged=true -NHibernateDialect=NHibernate.Dialect.MsSql2005Dialect -tcpport=18443 -KeyContainerName=SviKeyContainer -keySize=2048 -CertificateThumbprint=”B632769D553604A14361142016189BD5F5722267” -ConfigPath=”C:\tmp\SviWebService.exe.config” -OldConfigPath=”C:\tmp\SviWebService.exe.config.old”
2. You can get the Certificate thumbprint by opening the original SviWebService.exe.config in notepad and searching for CertificateThumbprint.
3. DSNNAME is the name of the DSN as it appears in ODBC administrator. I did a configure on it then a copy paste for ensured accuracy.
Once you run that you should see configuration saved successfully.
1. If you see SviConfig finished with an error. Exit code: 5 you probably did something wrong like copy the wrong config file or didn’t do one of the file copies right. For example I copied the wrong config file the first time, then didn’t rename them exactly right.
2. If you are sure that you copied it right and are still getting an error, sounds like you’ll need to make a backup of the SQL database and do a View Composer uninstall/reinstall.
The final step before starting the service is to copy the modified SviWebService.exe.config file to the DRIVE:\Program Files (x86)\VMware View Composer directory. Works! Huzzah! Now to go talk to that DBA about service account expiration policies…