VMware Horizon View… great product. View Composer? Thorn in my side.
Two weeks back I completed the upgrade of our View infrastructure from 5.3.2 to 6.0.1. It was a smooth upgrade, seemingly, and I was pretty pleased with how little time it took to complete the job. Victory for our team? Not so much.
Over the next week, I had dozens of complaints from IT staff that recompose operations were failing, searches for events related to these failures were returning no results (or just not completing at all), and there were multiple odd “I am getting this weird error on my desktops!” complaints. The desktop errors all turned out to be unrelated to the upgrade (the template was out of disk space, so the user profile could not load, the View Agent installation was broken, etc. etc.), but sorting out the event log and composer problems were harder…
View 6 Event Log database bug:
Following the upgrade, I was looking into increasing the View Event Log query limit per the request of a client, who was not able to view more than the past few hours of events for his pool owing to the default event query limit of 2000 events. I noticed that these queries, in addition to being short on useful information, also were taking several minutes to complete. After bumping the query limit to 6000 events, we found that the queries were taking over 30 minutes to complete, and hogging up all the CPU on the Virtual Center server (where the events database is hosted)! I verified that memory and disk were not bottlenecked on the SQL database (I could not add more CPU because I already was at the SQL Standard Edition max of four cores), and set SQL tracing to look for deadlock events. After running into a bunch of dead ends, I finally opened a support case with VMware.
Unsurprisingly, the first response was “well, lower your query limit.” I explained that no, I was not going to do that. I also pointed out that selecting 6000 records from a 2.4 Gb database really should not take 30 minutes, and that engineering just needed to buckle down and fix whatever index was causing the problem. A few days later, I was given one line of T-SQL to run against the View Events database to add a missing index. Query got executed, index created, and voila! Event queries started running in seconds, not hours. Here is the T-SQL:
CREATE INDEX IX_eventid ON dbo.VE_event_data (eventid)
Your table name might be slightly different, depending on the table prefix you selected when setting up the events database.
We have seen this before… someone recomposes a pool, the job half-finishes then stops, no error. The task cannot be canceled, the pool cannot be deleted, and all other Composer operations in the infrastructure grind to a halt. Why? If you call VMware support, the first thing they will tell you is “cache corruption”. The next is “stale desktops”. Huh?
Deleting Stale Desktops:
Clearing the Connection Server Cache:
No KB for this one that I am aware of. Here is that they always tell me to do… ready? You are going to like this…
- Shut down all of the connection servers in your farm.
- Turn the connection servers back on, one at a time.
The worst part is, that neither of these solutions worked. However, what I did find was that after powering the connection servers back on, some composer operations would succeed, but it was only a matter of time before one job failed an brought operations to a halt. Finally I noticed that when rebooting one of the connection servers (the newest one, used for testing security settings), jammed jobs would immediately resume. After digging into the logs in C:\ProgramData\VMware\VDM\logs\, I found that the Connection Server was reporting literally thousands of “could not connect to vCenter server at URL…” errors per day. Why? Because like a noob I did not give this connection server in interface to the vCenter server. Bad on me. However, these critical failures do not show up in the Windows event logs, nor do they get reported up to the View Administrator console. I had a bad connection server in my environment that was killing Composer operations, and View Administrator thinks everything is peachy. Boo! I have complained to VMware support, for what it is worth. I also fixed the connection server, and things are back to “normal”, whatever that means.
I also got my manager to approve using Splunk to collect all View log files, so that I at least will have an easier time of discovering errors when they arise in the future.