We run Exchange 2013 CU8 in a DAG configuration (3 DAG members in primary site, one in DR site). Two weeks ago we applied Windows patches to all servers in the DAG and ever since we have been having an issue.
The issue is this - the Exchange Replication service is not accepting *management* connections from servers in the other site. For the purposes of explanation, let's call the sites Primary and DR. If I do a 'get-mailboxdatabasecopystatus' from a server in
the Primary site, it shows me the status correctly for database copies on servers in the primary site, but shows a status of ServiceDown for all database copies in the DR site.
Similarly, if I run get-mailboxdatabasecopystatus in the DR site, it correctly shows the status for databases hosted on the server in the DR site, but shows ServiceDown for all database copies on servers in the primary site.
Here's the kicker - DAG replication is up to date, meaning that the DAG is replicating fine and all copies in all sites are staying up to date without any issues or delays.
Similarly, all other management functions between sites is working - viewing the queues, accessing message tracking logs, reading/setting virtual directory settings, etc, all are working correctly. If I do a get-databaseavailability group it succeeds, but
if I do a "get-databaseavailabilitygroup -status" it fails for servers in the other site with the following error:
"A server-side administrative operation has failed. The Microsoft Exchange Replication service may not be running on server (servername). Specific RPC error message: Error 0x71a (The remote procedure call was cancelled) from RpccGetCopyStatusEx4"
Here's what we've tried so far:
- Totally disabled IPv6 using the registry entry for DisabledComponents = 0xffffffff method
- Upgraded Exchange 2013 to CU14
- Insured that Windows Firewall is disabled
- Insured that WAN firewall is configured for any<==>any
- Rebooted countless times
- Validated that all other management and RPC functions work (mapping drives between sites, Exchange management across sites, etc)
- Disabled all network offloading/RSS/Chimney functions
We're approaching the point of beginning to roll back the Windows updates, but if it was a Windows update that caused this we'd think that other people would be experiencing it, but haven't been able to find anything exactly like this (management queries
to replication service failing only across the WAN).
Any ideas?