Entering VSAN Maintenance Mode Hangs at 65%

Ran into a weird situation in my lab where entering maintenance mode from the Web Client while attempting doing a full data evacuation WHILE there’s a failed disk in the disk group. It looked like it was unable to enter maintenance mode because one VM couldn’t get moved off the host, the reason? I’m not exactly sure.

Short very un-completely analyzed answer for me was either don’t do a full evacution, or vmotion the VMs to a different host BEFORE attempting to evacuate it. My situation was also weird because I had 1 out of the 2 magnetic disks in a failed state in the single disk group in the host that I was attempting to move away from.

I canceled the task after it ran for 12 hours, moved the powered off VM with a normal vmotion, then entered maintenance mode AOK. Sorry for the complete lack of detail, but nothing came up on teh interwebz for VSAN maintenance mode hanging at 65%.

 

 

Posted in Uncategorized

RSAT Tools on an App Volume?

App Volumes is probably one of the coolest technologies out right now. When VMware bought them, the first thing I thought is that we now finally have the means to control the virtual desktop, soup to nuts. We can make minute changes, have versioning, and quickly react to new challenges presented to us in various use cases.

Administrators for a long time have wanted the granularity of complete control of applications on a desktop, but that control leads to issues when managing at scale with traditional architecture. App Volumes gives us the ability to get down and dirty. To try various setups and ultimately allow for the flexibility of layering applications. But what happens if what we want to put in an App Volume isn’t an application at all? What if you want to enable operating system features like Microsoft Remote Server Administration Tools?

Recommended practice for your provisioning machine is to keep it as clean as possible, only putting the bare essentials in the OS so that the app layer you are trying to capture only contains the application information. Unfortunately with RSAT Tools, there is no application to install, but rather features that you can enable with a Windows Update that can be found here: http://www.microsoft.com/en-us/download/details.aspx?id=7887

To enable RSAT Tools in an App Stack, install the above Windows Update on both the provisioning PC and the Gold Image that is used as the parent machine for your vdi pool. Create a new App Stack or update an existing Stack and attach it to the provisioning machine. Head over to the Control Panel, open Programs and Features, and click on Turn Windows features on or off. Under Windows Features, choose Remote Server Administration Tools, and then enable the features you want users that will subscribe to the App Stack to have. At this point, you can finish the provisioning, and reboot the provisioning machine. The only thing left to do is attach the App Stack to a user desktop and test the functionality.

~ DJ Gillit, VCP5-DCV, VCP5-DT, VCP-NV

Follow me on Twitter: @djgillit

Posted in Uncategorized

LAN in a CAN 1.0 – VMware ESXi, Multi-WAN pfSense with QoS, Steam Caching, Game Servers

The goal of this blog post is to highlight one of my “off the clock” creations that has faithfully serviced 4 LAN parties and counting. I wanted to share the recipe I used; I call it LAN-in-a-CAN. The goal of this project was to simply provide a portable platform which could handle LAN parties of up to 50 guests in an easy to use and quick to set up configuration, complete with download caching for CDN networks, pre-validated QoS rules to ensure smooth gaming traffic and a virtual game server capable of hosting any game servers. Everyone in the Ohio and surrounding area, I invite you to come check our LAN out, you can find details at www.forgelan.com.

BIG PICTURE: Roll in, plug in internet, LAN Party GO! Less headache, more time having fun!

intro

First of all, massive props to the following:

  • All the guys at Multiplay.co.uk and their groundbreaking work on the LAN caching config
  • All the guys investing time in pfSense development
  • @Khrainos from DGSLAN for the pfSense help
  • Sideout from NexusLAN for the Posts and Help

Pictures:

IMG_1008 IMG_1006

IMG_1002

IMG_1004 IMG_1005

 

Physical Components:

Case: SKB 4U Rolling Rack ~$100

Network Switch: (2x) Dell PowerConnect 5224 (x2) 24 port Gig Switch ~$70 each

Server Operating System: VMware ESXi 5.5 U2 Standalone – Free

Server Chassis: iStarUSA D-213-MATX Black Metal/ Aluminum 2U Rackmount microATX Server Chassis – OEM ~$70

Server Motherboard: ASUS P8Z77-M LGA 1155 Intel Z77 HDMI SATA 6Gb/s USB 3.0 Micro ATX Intel Motherboard ~$90 (DISCONTINUED)

Server Memory: G.SKILL Ripjaws Series 32GB (4x 8GB DIMMS) ~$240

Server CPU: Intel Core i3-3250 Ivy Bridge Dual-Core 3.5GHz LGA 1155 55W Desktop Processor ~$120

Server PCI-Express NICS: (3x) Rosewill RNG-407 – PCI-Express Dual Port Gigabit Ethernet Network Adapter – 2 x RJ45 ~$35 each

Server Magnetic Hard Drive: Seagate Barracuda STBD2000101 2TB 7200 RPM 64MB Cache SATA 6.0Gb/s 3.5″ Internal Hard Drive -Retail kit ~$90

Server SSD Hard Drive: (2x) SAMSUNG 850 Pro Series MZ-7KE256BW 2.5″ 256GB SATA III 3-D Vertical Internal Solid State Drive (SSD) ~$340

Wireless Access Point: $35

Total Cost: ~$1350 with Shipping

My comments on the configuration at present:

When you think about it….it’s not much more than a PC build. As cool as it looks, this is a garage sale lame-oh build. I was able to save some cash by repurposing older hardware I had laying around such as the CPU, memory, motherboard and hard drives. When examining the scalability of the solution, the first thing that would have to change is the CPU clock speed & core count, followed immediately by a decent RAID controller. If it was required to service LANs larger than 50-70 in attendance, we’d probably want to change the layout so that the caching server would have its own physical box.

Virtual Components:

pfSense Virtual Machine

Purpose: Serves DNS, DHCP, QoS, ISP connectivity and routing.

Operating System: FreeBSD 8.3

Virtual vCPU: 2

Virtual Memory: 2GB

Virtual NIC: 3 vNICS, 2 for WAN connectivity, 1 for LAN connectivity

Virtual Hard Drive: 60GB located on 2TB slow disk

Windows Game Server (Could Also be Ubuntu if Linux Savvy)

Purpose: Serves up all the local game servers, LAN web content, teamspeak, and is a TeamViewer point of management into the system remotely

Operating System: Windows Server

Virtual vCPU: 2

Virtual Memory: 4GB

Virtual NIC: 1 for LAN connectivity

Virtual Hard Drive: 400GB located on 2TB slow disk

Nginx Caching Server

Purpose: Caches data for Steam, Blizzard, RIOT and Origin CDN networks. I’ve only gotten Steam to actually cache properly. The others proxy successfully, but do not cache.

Operating System: Ubuntu Server 14.04, nginx 1.7.2

Virtual vCPU: 4

Virtual Memory: 12GB

Virtual NIC: 8 for LAN connectivity

Virtual Hard Drive: 1x 80GB on slow disk, 2x 250GB disks, one on each Samsung 850. ZFS stripe across both disks which holds the cache

Download Box

Purpose: Windows box to pre-cache updates before the LAN starts. Allows multiple people to log in teamviewer, then into steam and seed all relevant games into the cache.

Operating System: Windows Desktop OS

Virtual vCPU: 1

Virtual Memory: 1GB

Virtual NIC: 1 for LAN connectivity

Virtual Hard Drive: 400GB located on 2TB slow disk

Big Picture Network and Caching Advice:

  • pfSense and Network
    • Use Multi-WAN configuration (yes, that means two ISPs) so that all proxied download traffic via the caching box goes out one WAN unthrottled, and all other traffic goes out a different WAN throttled. 2 WANs for the WIN. If you only have one WAN ISP connection, then just limit the available bandwidth to the steam caching server.
    • Use per IP limits to ensure that all endpoints are limited to 2Mbps or another reasonable number per your total available bandwidth. QoSing the individual gaming traffic into queues proved to be too nuanced for me, it was just plain easier and more consistent to grant individual bandwidth maximums. This doesn’t count the proxied and cached downloads, which should be handled separately per the above bullet point.
    • Burst still isn’t working properly with the limiter config, apparently not an easy fix. Bug report here: https://redmine.pfsense.org/issues/3933
    • DNS spoofing is required here which can be done out of the box using the DNS forwarder on pfSense
    • We made the config work once on a 5/1 DSL link, but it was horrible. Once we at least had a 35/5 cable connection everything worked great.
    • The PowerConnect 5224 just doesn’t have enough available ports to fully service all the connectivity required. While it does allow you to create up to 6 LAGs with up to 4 links each, it basically means that there aren’t enough ports to have a main table, plus 2 large tables with enough LAGs in-between them. I recently purchased a used Dell 2748 to replace the 5224 at the core in the rack for around $100, and am looking forward to updating it.
  • Caching Box
    • Future state for big lans, this would be a separate box like the guys at Multiplay.co.uk have laid out, 8x 1TB SSD in a ZFS RAID, 192GB memory with 10Gb networking, and dual 12 core procs. i3 procs with Samsung Pros seems to be working OK for us, but that’s because of the small scale. Would definitely need a couple hoss boxes at scale.
    • When using multiple disks and ESXi, don’t simply create a VMFS datastore with extents, as the data will mostly be written to one of them and IO will not be staggered evenly. For this reason I opted to present two virtual hard drives to the virtual machine and let ZFS stripe it at the OS.
    • Created a NIC and IP for each service that I was caching for simplicity sake.
    • There are some Nginx improvements coming that will natively allow caching for Blizzard, Origin, which don’t work today due to the random range requests, which can’t be partially cached. See Steven’s forum post here for updates: http://forum.nginx.org/read.php?29,255315,255324
    • CLI tips:
    • Utilize nload to display NIC in/out traffic
      • sudo apt-get install nload
      • sudo nload -U G -u M -i 102400 -o 102400
    • Log directories to tail for hit/miss logs
      • cd /data/www/logs
      • tail -f /location_of_log/name_of_log.log
    • Storage Disk Sizes & Capacity with Human Readable Values
      • df -h

Achievable Performance:

Single Portal 2 Download to a Mac with SSD drive: 68MBps! Whaaaaat??? Ahamazing.

Portal2Download

 

Was able to push 94MBps of storage throughput at a small miniLAN with 8 people downloading portal 2 at the same time. Those 8 people downloaded portal 2 in about 10 minutes total.


forgelanstorageesxi

It’s been a great project, and I’m excited for the future, especially with the up coming nginx modules that will hopefully provide partial cache capability.

Posted in Uncategorized

Installing ESXi 6.0 with NVIDIA Card Gives Fatal Error 10: Out of Resources

Thanks much to the original author of this post in the Cisco community, I was able to find the answer quickly. Figured I’d put it here in case anyone else hits it as well since the error message is a little off.

https://communities.cisco.com/community/technology/datacenter/ucs_management/blog/2014/02/17/error-loading-toolst00-fatal-error-10-out-of-resources-cisco-ucs-c240-m3-server-with-esxi-55

Decompressed MD5: 00000000000000000000000000000000

Fatal Error: 10 (Out of Resources)

Fatal

fatal2

 

Change the following MMCFG base parameter to 2 and you can install AOK.

Posted in Uncategorized

Horizon Workspace 2.1 – Logon Loop after Joining AD Domain

Turns out that if you join Horizon Workspace to the AD domain which uses a different URL namespace such as “pyro.local” than what you’d actually publish, such as “workspace.pyro.com” you actually have to modify the local hosts file of the appliance otherwise you’ll end up in an endless loop of frustration like I was for the past several hours. The worst part is that the config saves just fine, but now authentication no longer works, leaving you puzzled and sad. Just goes to show you, read the release notes. Hopefully this helps someone faster than google did previously for me.

Screen shots below. Release notes which apparently I didn’t take time to read state:

1a

 

When joining the AD domain, you’ll get this. Which is fine, except people on the outside can’t resolve that address.

1bNow it’s off to the vi to add us a hosts entry. I had to login as sshuser then sudo to root. 
3 3aFor those of you not fluent with the VI text editor, which is myself included, just hit I to insert, type what you need to type then hit escape to stop inserting text.

After you do that hit ZZ to save and write the changes to the file. Lo and behold! You can resolve your external address and you don’t have a sad loop. It’s a Christmas miracle. What a rawr.

 

 

 

Posted in Uncategorized

Adding Microseconds to vRealize Operations Graphs

In my old role as a vSphere administrator for a single company, we upgraded our storage from legacy spinning disk with a small amount of cache. This environment often experienced disk latency greater than 3ms on average and often time much higher than that, getting up to 20ms or higher for 30 seconds at a time, over a thousand times a day. We upgraded to a hybrid array with results of less than a millisecond latency for read and write operations…consistently.

vCenter Performance

We tracked this with vCenter as well as vCenter Operations Manager. In vCenter, it’s tracked on a per virtual machine instance. We could monitor this by using the vCenter performance tab on a virtual machine and selecting Virtual Disk and monitoring the read and write latency numbers for both Milliseconds and Microseconds, as seen below.

Monitoring in vCenter was great, but it’s only available for live data, or the last hour. That doesn’t help in seeing history. We fixed this by looking into vCenter Ops and in our installation, the numbers were there. Great…history. We can now use this latency number as justification for our new storage selection.

Fast forward to my current role, as a consultant. I recently installed vRealize Operations Manager for a customer. One of the reasons was to monitor disk performance as they evaluated new storage platforms for their virtual desktop environment. As I looked into vCenter, the counters were there, but when I looked into vRealize Ops the counter wasn’t available. What gives?

After contacting my previous coworker, we compared configurations and neither one of us could figure it out. I already had a case open with VMware regarding the vRealize Ops install for an unrelated issue. On a recent call with VMware support, we compared these two vCenter/vRealize Operations manager environments. It really stumped the support engineer as well. After about five minutes, we figured it out. When vCenter Operations Manager was installed, the selection to gather all metrics was selected. vRealize Operations Manager doesn’t give you this options. Instead you need to change the policy.

Here’s How:
Log into vRealize Operations Manager, select a VM, and select the troubleshooting tab. Take a note of the policy that is effect for the VM…if you haven’t tweaked with vROPs, then the policy will be the same for all objects.

vRealize Policy

Click on this policy, this will link to the administration portion of vROPs focused on the “Policies” section.

vRealize Policy List

Click on the “Policy Library” tab. Drop down the “Base Setting” group, and select the proper policy that was at the top of the virtual machine troubleshooting tab. Then click not the edit pencil.

vRealize Operations Edit Policy

Once in the “Edit Monitoring Policy” window (shown below), limit the Object type to “Virtual Machine” and filter on “Latency”. This will reduce the object to a single page.

vRealize Override Attributes

Change the “state” for the Virtual Disk|Read Latency (microseconds) and Virtual Disk|Write Latency (microseconds) to “Local” with a green checkmark.

vRealize Operations Attribute Changes

After about 5 minutes, verify that the data is now displayed in vRealize Ops.

vRealize Operations Graph

 

Why is this counter important?
As SSD arrays like Pure Storage and XtremIO or hybrid arrays like Tintri become more prevalent, millisecond latency numbers are near 0 and result in a single flat line and no real data, but microseconds really allows administrators to see any changes in latency that would otherwise be flattened out.

I’d like to thank El Gwhoppo for allowing me to guest author. I hope this tidbit is useful information!

Tagged with: ,
Posted in vCenter, VMware

How to Extend XtremIO Volumes

Want to extend an XtremIO volume? Simply right click on it and say modify volume. Pretty darn easy. Was somewhat annoyed to find that this information isn’t exactly readily available from “the Google”, so I decided to write this up after verifying it is a non-destructive process from my EMC home boys. This was performed on an XtremIO running on the 3.0.0 build 44 code. Not sure if it will work for shrinking a volume, but perhaps someone can comment up with that information.

xio1 xio2 xio3 xio4

 

 

Posted in Uncategorized
Papers
People
Map

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 16 other followers

Follow

Get every new post delivered to your Inbox.