Search This Blog

Monday, March 19, 2012

Multiple Exchange UM servers and microsoft Lync

A few weeks ago I was troubleshooting an issue with the Exchange 2010 UM auto attendant. When we called the attendant and asked to call a user within the organization the call would fail. 
  
This customer has two exchange UM servers, Node01 and Node02. We configured Exchange UM to use Lync as UM IP Gateway and everything worked well, except the Exchange UM attendant. When calling the attendant and asking the attendant to call a user, the call failed. We also saw the following event appearing  in the application eventlog:

Event ID: 1400 Source: MSExchange Unified MessagingThe following UM IP gateways did not respond as expected to a SIP OPTIONS request.
Transport = TLS, Address = lyncpool.domain.com, Port = 5061, Response Code = 0, Message = This operation has timed out.

1400 (Warning/MSExchange Unified Messaging) appearing regular in the event logs on the exchange UM servers, but didn't pay to much attention to it as everything was working (did only test the Exchange UM mailbox and not the Auto Attendant). Exchange has the Lync mediation server pool configured as UMIPGateway using a TLS communication. The TLS certificate that was placed on the Exchange for UM had following parameters configured: 
  • Common Name: UM.domain.local 
  • Subject Alternative Names: um.domain.local, Node01.domain.local, Node02.domain.local. 
I would like to express the fact that users where able to access their UM mailbox, and where able to retrieve or leave a spoken message in the UM mailbox using Lync (so here was TLS communication between Lync and Exchange).

In order to troubleshoot this, I increased the event logging level on the Exchange servers to expert level for Exchange UM and installed Wireshark to monitor the network traffic, and enabled logging on the Lync servers. Restarted testing with the Exchange UM attendant to call a Lync user. 

As expected the call failed. The application log on the exchange server and Lync logging didn't show any useful information, besides that the communication terminated unexpectedly. However the wireshark traces showed that only authentication traffic was passing between the two servers. Although the log did not explicit showed that authentication was falling i did presume that TLS authentication was failing as that was the only traffic between the two servers that was recorded. 

I inspected the Exchange Certificate over and over again, but to my knowledge nothing was wrong with the certificate. Spending hours searching the INTERNET I found two similar cases, one had the same event ID but was using OCS and had a wild card certificate which was not supported. The other one had a single UM server and he opened a call with Microsoft, troubleshooting with Microsoft pointed out that the problem occurred because the Subject name of his certificate was set to the external name of Exchange OWA.

At first I didn't pay much attention to the post, because i was still convinced that all PKI requirements where met. Up to that point I didn't pay that much attention to the common name value, and made sure that all the names that could be used in the communication with the server array is present in the Alternate Subject Names. The common name value was always set to the external name of the server array, which is according to Microsoft best practice:


[Quote]
As a best practice, you should minimize the number of certificates you use for your Client Access servers, reverse proxy servers, and transport servers (Edge and Hub). We recommend using a single certificate for all of these service endpoints in each datacenter. This approach minimizes the number of certificates that are needed, which reduces both cost and complexity for the solution.
[Unquote]

Source: http://technet.microsoft.com/en-us/library/dd638104.aspx 

Running out of Idea's I decided to change the Unified Messaging certificate to match the common name to the FQDN of the server on one server. Stopped the MSexchangeUM service on the other to make sure that the one would be used that had the new certificate. Resumed testing, to my surprise the attendant is now able to call users through Lync.
 

Surprised by this outcome, made me wonder and doubt everything I knew from PKI so far. As with every issue I encounter, I will always try to explain that issue to myself in which I can explain why the issue occurred and what I can do to prevent it.

Been deploying Exchange for many years now, and never ran into any issue's regarding PKI, and this encounter shacked my world. It seemed that the way I was deploying Exchange Certificates had a flaw But if it has a flaw, how come I never ran into any similar issue's before?      
Have to admit that I haven't deployed a lot UM server roles, as many enterprise already have an existing solution. But surely did a fair share of Exchange deployments with multiple Hub/Cas servers and never ran into issue concerning certificates.  

Maybe there was nothing wrong with the certificate in which the common name of the array can still be used if I change the UM server name by using the Set-UMServer cmdlet. The UM server was still pointing to each server individually. But if changing the UM server to represent the name of the array, will we loose high availability? As in when Round robin is used, clients are pointed to servers that may or may not be on-line...

What about manageability? If the common name has to be the FQDN of the server, you would need to run a certificate request on each server, and each server will have its own private key. But If you use one common certificate for all, you would need to change the certificate on all servers if you wish them to use the same private key.

Is there an advantage of using a singe shared private key among all your servers? Hmmm, not sure. In case of Exchange UM surely not, as it is real-time, and in case of fail-over the session would always be lost. But what in other commodities (SMTP, HTTPS, RPC/MAPI)? No, I don't think so. Even if you have hardware load-balancers in place, a new session will be created when a fail over occurs.

The more I keep pondering about the subject, the more questions arise in my mind.  


 

 

No comments:

Post a Comment