ActiveSync woes–"Cannot get mail" and the case of the endless re-sync

cannotgetmailWe recently experienced a really bizarre issue with our ActiveSync infrastructure. Users started complaining that their contacts were disappearing, and that their inboxes would re-synchronise constantly. All items in the inbox would disappear, and then reappear, starting with the oldest item. Some items were even dated at the Unix epoch. Users on iOS would get an error screen “Cannot get mail”, and downloading emails would time out or take a very long time.

We’re set up with TMG in our DMZ, which then sends traffic to a pair of CAS servers internally. We’ve been running Exchange 2010 SP2 and 2003 in co-existence for some time now, as some of our national offices are still in the process of migrating users across.

Our troubleshooting covered all areas, from looking at ActiveSync logs from IIS, running the Test-ExchangeConnectivity scripts, to testing on the devices themselves – you name it, we tried it. Here’s a quick way to turn up the logging level on ActiveSync using PowerShell:

Get-EventLogLevel | Where-Object {$_.Identity -like "MSExchange ActiveSync*"} | Set-EventLogLevel -Level High

The usual suggestions of permissions on the user account in AD and various other settings were not relevant. We even investigated the possibility that the problem could be caused by users still on iOS 4.0, which was known to cause issues and unusually high server load.

We then noticed that the TMG box would experience timeouts when requesting DNS resolution from our internal DNS servers. There were also errors from the TMG connectivity verifiers for AD that the LDAP servers were unreachable. This pointed to some sort of connectivity issue between TMG and and the CAS servers. Circumventing the TMG box by VPN’ing in or connecting via our corporate WiFi seemed to resolve the issue.

Upon inspection of our Netscreen 25 firewall, we noticed a lot of error messages about the source IP session limit being exceeded:juniperlog

This is by design. It turned out that our DMZ had previously had IP based session limits set to a threshold of 128 sessions. This limit was being exceeded by the large number of ActiveSync users we now have. We bumped up that number to 512, and our problems are now resolved.

