Response Group Calls not Presented to Agents due to SQL Write issues (Event ID 32269 and 32270)

By | June 21, 2017

The issue

So, I ran into a weird issue a while back where Response Group Calls would be answered by RGS, the IVR messages and actions would play out and the call would be dropped into a queue to look for an agent.
The Caller would hear the queue on hold music and after the queue timeout period be correctly routed to voicemail
The issue was, during the 30 odd seconds the Callers were in the queue. Calls were not presented to RGS Agents.
I checked all the obvious stuff, users signed in to RGS, Presence set correctly, not already on a call etc and tried removing and re-adding the users to the RGS group.

Still no good.

TL;DR Check for Event ID 32269 or 32270 on the Skype4B frontend servers,if you see a bunch of them, shut down the pool entirely and restart the SQL instances.Then reboot the Skype4B frontends and re-start the pool.
Log of Publication Delay showing 54 Hours of Status Drift

Turns out Grieg had this issue in his lab too, https://greiginsydney.com/sfb-2015-server-update-cu4-november-2016/

 

Looking into the issue

Further investigation showed that the RGS presence watcher wasn’t correctly updating presence for these users. Causing RGS routing to fail by incorrectly asserting a user was busy/not busy.My investigations found that the Skype front end servers were unable to inject entries into the QoE database and replicate changes to the backend database for users, including presence.

Example of QoE Entry failure, indicating an issue with the SQL backend

Example of QoE Entry failure, indicating an issue with the SQL backend

These issues are typically caused by extreme load on the system or a failure of the SQL backend.Reporting via the Statsman package and other built-in windows tools showed the servers were well within normal operating range and not under any undue load. Network connectivity tests between the frontends and the SQL backend also passedUpon logging into the SQL server I noted that it reported the server had been shut down un-expectedly. I checked the event logs to see the server rebooted due to a BugCheck (Blue Screen), additionally later that night another host in the cluster rebooted due to a BugCheck (BlueScreen) error as well.

Examples of the BugCheck messages on the SQL Cluster nodes

Examples of the BugCheck messages on the SQL Cluster nodes

Examples of the BugCheck messages on the SQL Cluster nodes

Examples of the BugCheck messages on the SQL Cluster nodes

 

Further checking into the reliability of the SQL backend showed multiple connection issues between Skype for Business and SQL clusters

Examples of connection issues

I then found reports Skype had issues running a stored procedure on the SQL backend early in the morning after the bluescreen

The server reported it was rebooting in the SQL error message, but I could find no log of this on the SQL server

I could see however that later in the day a SQL Administrator logged into the server and confirmed an un-expected shutdown on both SQL cluster nodes

Unexpected Shutdown confirmed

Unexpected Shutdown confirmed on node1

Unexpected Shutdown confirmed on node 1

Unexpected Shutdown confirmed on node 2

Unexpected Shutdown confirmed on node 2

Soon after the unexpected shutdown was confirmed, load increased on the Skype Servers due to the start of business, at 8:54 AM Publication Sync issues started being issued on the frontends, this in turn started the issues with the queues as actual presence and published presence slowly drifted apart.

Initial Log of Publication Delay

Initial Log of Publication Delay

Log of Publication Delay showing 54 Hours of Status Drift

Log of Publication Delay showing 54 Hours of Status Drift

To resolve this issue, I needed to stop the services on the entire pool at the same time and restart the SQL nodes., restarting servers 1 by 1 didn’t help as the RGS matchmaking service would move (in its stuffed state) from server to server.When the pool was restarted the issue was resolved, we escalated this to the SQL team and asked them to investigate.

 

 

One thought on “Response Group Calls not Presented to Agents due to SQL Write issues (Event ID 32269 and 32270)

Leave a Reply