This time it was Citrix, … sort of!

It really WASN'T Citrix's Fault though!

It’s Storytime with Jacques Bensimon! Enjoy!

“It’s NOT Citrix!” – Jacques Bensimon

I recently had to troubleshoot a strange problem affecting every XenApp server in a site (don’t even try to get me to call them Citrix Virtual Apps servers!):  within hours of a fresh reboot, they all started experiencing a variety of issues including inability to access home directories and other network shares, AD authentication issues (“No logon server available”), Group Policy application failures, loss of registration with any DDC, etc.  Of course, like any self-respecting Citrix guy (war cry: “It’s not Citrix!”), my go-to move when faced with even one of those symptoms is to immediately blame the network, amaright?!   This was an especially attractive explanation in this case given that it was occurring in a newly brought up data center (therefore without a track record of network reliability) and that the servers in question were running a (Citrix PVS) provisioned image that only differed from the previous production image at the original data center in that I had updated its VMware Tools  (because new hypervisors) and applied a few Windows & Office updates while I had the chance, combined with the fact that they all behaved perfectly for some time following restart.

A nagging counterargument however was that other (non-XenApp) hosts on the same hypervisors, subnets, storage, etc. were experiencing no such network issues, and resetting all the affected machine’s account passwords via Provisioning Services (the one outside possibility I could think of initially) did nothing to resolve the matter.  So, faced with the possibility that maybe the updated vmxnet3 NIC driver or one of the new Windows (2016) updates were messing with network connectivity, I had no choice but to look deeper into it (with little hope that this was the explanation given that the servers’ PVS connections remained rock solid throughout – Foreshadowing:  those are UDP-based!]  The event logs were of course full of complaints about failed authentications, GP application attempts, DDC registrations, etc. with no explicit cause(s) provided, but a few warning events (I think from source Tcpip, maybe ID 4227 or 4231, maybe not – sorry, I wasn’t thinking “screenshot” in the heat of battle!) mentioned something along the lines of being unable to re-use an ephemeral TCP port soon after release (my words).

So off to SysInternal’s TCPView I go, and sure enough, so many thousands of TCP ports were in use that TCPView was unable to enumerate them all in real time and was so busy trying to keep up with displaying ongoing allocations (green) & releases (red) to ever give a final count in its status bar as it normally does.  And “who” was responsible for all but a few of these port allocations?  A mysterious single java.exe instance that was continuously establishing and dropping connections to & from localhost (obviously many more establishments than teardowns).  Moving on to SysInternal’s Process Explorer, the java.exe instance turns out to be the child process of a service executable I recognize as belonging to a third-party event log consolidation product installed on the image, one I shan’t name given (a) my aversion to the possibility of legal action, and (b) my well-known sense of fairness – the product had to my knowledge never before caused similar issues, and this one could easily have been due not to the collector service itself but to some problem on its back-end server (though a more graceful failure behavior would have been commendable) – this is still under investigation (not by me, thank you very much).  Stopping the service, which also cleanly ended the Java process, immediately restored some of the previously compromised capabilities (e.g. Group Policy application and DDC registration), but some lingering damage remained (e.g. domain secure channel verification via nltest /sc_query:DOMAINNAME – still complained of “No logon server available”).  Stopping and disabling the service at startup (remember this was happening on a read-only provisioned image) via the servers’ standard Startup script (could also have used Group Policy) entirely resolved the matter.

Silver lining:  My legendary mistrust of server monitoring products (if log collection isn’t exactly “monitoring”, it’s certainly “monitoring-adjacent”), source of much amusement among my colleagues and clients (okay, maybe it’s annoyance, but I like to stay positive ) is once again proven to have some merit.  Maybe it’s just me, but I like to leave a little CPU, RAM, and network bandwidth for the computing pleasure of actual users rather than for background pigs processes … and I’m now adding TCP ports to that list!

Curmudgeonly yours,

Jacques.


Be sure to follow Jacques Bensimon on twitter at @JacqBens

TAGS