Preventing a hung XD session from breaking Storefront DDC Load balancing

image Chris Hahn sent over an incident that happened at a client recently.

Using findings from XenDesktop Scout and Storefront Event logs,  Chris was able to troubleshoot a sporadic reoccurrence of “There are no apps or desktops available” error when users were trying to log into the XenDesktop Environment.

The data indicates a domino effect being triggered by a user trying to reconnect to XenDesktop session that is in a hung state with the following sequence of events:

  • UserA logs in to Storefront server 1 and tries to reconnect to their existing VDA session, but the VDA is hung
  • Desktop Studio shows the VDA session as Active but unregistered
  • The VDA is pingable but not accessible remotely such as when attempting to browse to \vdamachinenamec$
  • vCenter reports VMTools service is not running on the virtual machine
  • DDC 1 times out trying to prepare the hung VDA for a connection
  • Storefront times out waiting for DDC 1 to respond, removes it from load balancing, and tries the next broker in the load balancing list
  • DDC 2 times out trying to prepare the hung VDA for a connection
  • Storefront times out waiting for DDC 2 to respond, removes it from load balancing, and tries the next broker in the load balancing list
  • DDC 3 times out trying to prepare the hung VDA for a connection
  • Storefront times out waiting for DDC 2 to respond, removes it from load balancing, and tries the next broker in the load balancing list
  • DDC 4 times out trying to prepare the hung VDA for a connection
  • Storefront times out waiting for DDC 2 to respond, removes it from load balancing

At this point, because all brokers for the target XenDesktop farm have been removed from load balancing, Storefront’s next action is determined by the allFailedBypassDuration setting which is set to 5 minutes. When all brokers for a farm are marked as down, Storefront waits 5 minutes before retrying to broker connections. This results in an outage of up to 5 minutes affecting any users attempting to enumerate desktops from the affected farm on the affected Storefront server.

With assistance from the Storefront support team, we are recommending changing allFailedBypassDuration to 0 on all Storefront servers to avoid the artificial outage induced by a single hung virtual machine. As I understand it, 5 minutes is the default setting on Storefront 2.5. On Storefront 2.6 the default is now 0 minutes.

See related eDoc article for additional information on how to Configure server bypass behavior.

TAGS