Gert Lievens - Blog

“Physics is the universe’s operating system.”

SQL Server 2005 cluster failover failed

We had an annoying problem while performing Windows Updates on the Windows Server 2008 Failover cluster at a customer. While testing the failover, we saw that the SQL Server 2005 cluster resource failover did not succeed. The strange thing was that nothing in the configuration had changed since previous succesfull tests. Also, the other resources failed over without any issues.

We did encounter the following errors in the Windows eventviewer and in the cluster log file under C:\Windows\Cluster\cluster.log

[sqsrvres] ODBC sqldriverconnect failed

[sqsrvres] checkODBCConnectError: sqlstate = 28000; native error = 4818; message = [Microsoft][SQL Native Client][SQL Server]Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'.

Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'. [CLIENT: IP]184800000E0000000A00000043004C0044004200310050002D00490030000000070000006D00610073007400650072000000

[sqsrvres] OnlineThread: Error connecting to SQL Server.

The client was unable to reuse a session with SPID 110, which had been reset for connection pooling. The failure ID is 88460000140000000A00000043004C0044004200310050002D0049003000000000000000. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.

SQL Server cannot accept new connections, because it is shutting down. The connection has been closed. [CLIENT: IP]

[sqsrvres] CheckServiceAlive: Service is dead

[sqsrvres] OnlineThread: service stopped while waiting for QP.

I had begun troubleshooting the most basic elements in a SQL Server cluster and started with the Browser Service. The Browser Service seemed to have no problems running but after a lot of searching, I found one inconsistency on one of the nodes.
The Browser Service runs under the Network Service. On each node of the cluster is a local security group that is used to give the required permissions to run the SQL Server Browser Service:

SQLServer2005SQLBrowserUser$ComputerName

Logically, the Network Service should be in this group, so it has the privilege to run the Browser Service correctly. Well, on one node the Network Service account was not in the local security group!

After adding the Network Service account to the local group and restarting the SQL Server Browser Service on both nodes the failover was working again... It was a long search for something as simple as that. It is unexplainable why this setting has changed. (Or rather - which I find more plausible - who had changed it)