Exchange

Introduction to Managed Availability: How to check, Recover, and Maintain your Exchange Organization Part II

Now that you’ve finished Part I of my three-part Managed Availability blog series, I will now go a bit deeper and provide some examples about the functionality and operability of Managed Availability. My virtual test lab contains a two-member DAG based on Windows Server 2012 and Exchange 2013 CU6.

  1. Identify Unhealthy Health Sets and their error description

To get the server state, run the following cmdlet within the Exchange Management Shell:

Get-HealthReport -Server | where {$_.alertvalue -ne “Healthy” –and $_.AlertValue –ne “Disabled”}

1

This cmdlet shows multiple HealthSets, which are Unhealthy. In this example, let’s take a look at the HealthSet Clustering, which has 5 Monitors.

Note: the property “NotApplicable” shows whether Monitors have been disabled by Set-ServerComponentState for their component. Most Monitors are not dependent on this, and thereby report “NotApplicable.”

Because the Clustering HealthSet has 5 Monitors, we check which Monitors are in an Unhealthy state:

Get-ServerHealth –Identity -HealthSet

2

The Monitor ClusterGroupMonitor is in an Unhealthy state. To get all the information, especially the appropriate Probe, take a look at the Event Viewer in a readable outpout with the following cmdlet:

(Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/MonitorDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “**”}

3

This output has two important values to identify the “real problem” of the Clustering HealthSet:

  1. SampleMask: defines the appropriate substring that the Probe ClusterGroupProbe for the Monitor ClusterGroupMonitor “ClusterGroupProbe\MSExchangeRepl“ have in their name
  2. ScenarioDescription: shows more information about the issue

From the output above, we found that Validate HA health is not impacted by cluster related issues and therefore wants to fix it.

You can retake some Probe checks with the cmdlet Invoke-MonitoringProbe \ -Server | fl

Note: For reference, you can take a look at the Exchange 2013 Management Pack Health Sets: http://technet.microsoft.com/en-us/library/dn195892(v=exchg.150).aspx Important: this cmdlet is only available if your Exchange servers are configured for time zones UTC and UTC-. The cmdlet doesn’t work with time zones UTC+ (hopefully Microsoft will fix this issue in the near future).

Let’s take a further look at the Probe configuration for the ClusterGroupProbe:

(Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “**”}

4

The next step is to identify the complete error message so that every administrator knows what he or she has to do:

$Errors = (Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/ProbeResult –FilterXPath “*[UserData[EventXML[ResultName=‘/’][ResultType=’4′]]]” | % {[XML]$_.toXml()}).event.userData.eventXml
$Errors | select -Property *time,result*,error*,*context

5

Result: the quorum resource “Cluster Group” is not online on server “xsrvmail2.“ Database Availability Group “E2K13-TAP” may not reachable or may have lost redundancy.

  1. Why could Managed Availability solve this issue not of itself?

Managed Availability is a “self-healing” component of Exchange 2013. As described in the steps above, responders are responsible for trying to repair the Exchange organization on its own without any administrator impact. Let’s take a look which Responders are relevant for the Unhealthy Clustering HealthSet:

To display all Probes, Monitors, and Responders of the HealthSet Clustering, run the following cmdlet in the Exchange Management Shell:

Get-MonitoringItemIdentity –Identity –Server | ft name,itemtype,targetresource –AutoSize

6 

You can see 3 Escalate Responders, based on the “Name” attribute:

  • ClusterEndpointEscalate
  • ClusterServiceCrashEscalate
  • ClusterHangEscalate

To identify the correct Responder for our Monitor ClusterGroupMonitor, run the following cmdlet in the Exchange Management Shell:

$DefinedResponders = (Get-WinEvent –ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml

$DefinedResponders | ? {$_.AlertMask -like “**”} | fl Name,AlertMask,EscalationSubject,EscalationMessage,UpdateTime

7

As you can see in the screenshot above, the appropriate Responder called ClusterGroupEscalate with the parameter Name, AlertMask, EscalationSubject, EscalationMessage, and UpdateTime.

Remember: Escalate Responders writes an entry in the Event Viewer to inform an administrator. This means that any issues with the HealthSet Clustering cannot be recovered automatically through Managed Availability.

For completeness, let’s make an example with the HealthSet OWA.Protocol:

8

As you can see in the screenshot above, there are much more Responder types for the HealthSet Clustering.

To identify the correct Responder for our Monitor OwaSelfTestMonitor with all necessary information, run the following cmdlet:

$DefinedResponders = (Get-WinEvent –ComputerName -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml

$DefinedResponders | ? {$_.AlertMask -like “**”} | fl Name,AlertMask,EscalationSubject,EscalationMessage,UpdateTime

9

You can see two Responders:

  • OwaSelfTestEscalate: ping request failed and an administrative intervention is needed (Escalate)
  • OwaSelfTestRestart: this Responder carried out a recovery action (but what exactly?)

To find out the recovery action from Responder OwaSelfTestRestart let’s grab all information about the Responder configuration:

(Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like “**”}

10

As you can see at the parameter ThrottlePolicyXml, which is customizable, there are some Responder definitions:

  • RecycleApplicationPool “MSExchangeOWAApPool” is self-explanatory
  • ThrottleConfig Enabled: if the ThrottlePolicyXml is enabled (True) or disabled (False)
  • LocalMinimumMinutesBetweenAttempts: how many actions can be taken on this server within the defined timeframe
  • LocalMaximumAllowedAttemptsinOneHour: how many actions can be taken on this server within one hour
  • LocalMaximumAllowedAttemptsinADay: how many actions can be taken on this server within one day
  • GroupMinimumMinutesBetweenAttempts: how many actions can be taken in the DAG or array within the defined timeframe
  • GroupMaximumAllowedAttemptsInADay: how many actions can be takten in the DAG or array within the defined timeframe

Next, you should take a look at the Microsoft-Exchange-ManagedAvailability/RecoveryActionResults crimson channel for entries. Event 500 indicates that a recovery action has begun, event 501 indicates that the action was taken has completed, and event 501 indicates that the action threw an error.

Note: you have a better overview if you go directly into the Event Viewer in the log name Microsoft-Exchange-ManagedAvailability/RecoveryActionResults. For specific troubleshooting, I prefer the Exchange Management Shell.

$RecoveryActionResults = Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ManagedAvailability/RecoveryActionResults

$XML = ($RecoveryActionResults | Foreach-object –Process {[XML]$_.toXml()}).event.userData.eventXml

$XML | Where-Object {$_.State -eq “Finished” -and $_.RequestorName -eq “OwaSelfTestRestart”}

Note: you can filter your log to the current day if there are too many items logged. It’s easy to use the parameter EndTime, such as the following cmdlet: $XML | Where-Object {$_.State -eq “Finished”  -and $_.EndTime -like “2014-08-28T18*” -and $_.RequestorName -eq “OwaSelfTestRestart”} RequestorName = your appropriate Responder, such as “OwaSelfTestRestart”

11

The screenshot above demonstrates two different recovery actions:

-The first recycled the MSExchangeOWAAppPool from last year (yes, Exchange works very well)

-The second created parallel a Watson Dump because the MSExchangeOWAAppPool application crashed.

For a general overview of all recovery actions, take a look at the Microsoft-Exchange-ManagedAvailability/RecoveryActionsLog crimson channel:

12

I prefer to use the Event Viewer because it is clearer. But for those who like to search for specific recovery actions or at an individual time, feel free to use the Exchange Management Shell and create your own additional filters if you need it:

$RecoveryActionLogs = Get-WinEvent -ComputerName -LogName Microsoft-Exchange-ManagedAvailability/RecoveryActionLogs

$XML = ($RecoveryActionLogs | Foreach-object –Process {[XML]$_.toXml()}).event.userData.eventXml

$XML | Where-Object {$_.State -eq “Finished” -and $_.RequestorName -eq “OwaSelfTestRestart”}

 – Happy Monitoring! =)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s