Archive for the ‘Monitoring and Management’ Category

Getting cruft objects out of Operations Manager

Recently I had to downgrade a SQL Express instance from the 2008 version back to 2005.  The downgrade solved my DB performance problems, but created a monitoring problem.  Operations Manager continued to believe that this server was running SQL 2008!

So, how do you get rid of a monitored object that is part of a dynamically discovered group?  The answer lies (as with most OpsMgr problems) in overrides:

http://blogs.msdn.com/boris_yanushpolsky/archive/2007/11/20/opsmgr-sp1-removing-instances-for-which-discovery-is-disabled.aspx

Boris of OpsMgr++ fame tells us to use the “Authoring” view in the OpsMgr console to find the “Object Discovery” rule that found your SQL instance (probably “SQL 2008 DB Engine”).  You then generate on override which will disable discovery for your named computer.  Since SQL discovery runs fairly infrequently, you may also want to override the same rule for all computers, forcing discovery to a more frequent interval (say… 300-600 seconds).

After discovery completes, open the OpsMgr PowerShell console, and run the “Remove-DisabledMonitoringObject” cmdlet (with no arguments).  If you are exceptionally lucky, your undesired object will disappear from the OpsMgr Monitoring view in short order.

OpsMgr Severity/Priority Levels: "What 'IS' is."

When working with OpsMgr overrides, I am always forgetting the mappings between alert severities and their corresponding numeric values in the database.  It is important to keep this straight, because if you set your overrides incorrectly, you risk either suppressing all notification for an alert, or even worse… increasing the number of notifications that you receive!

Marius provide the following mapping info in his fine blog on MSDN:

Mapping:

Alert Severity – Its corresponding integer value

Critical – 2
Warning – 1
Information – 0

Alert Priority – Its corresponding integer value

High – 2
Medium – 1
Low – 0

Read more here:

http://blogs.msdn.com/mariussutara/archive/2007/12/17/alert-severity-and-priority-use-with-override.aspx

So remember, when downgrading an alert from "Critical" to "Warning", change in from "Severity 2" to "Severity 1".  "Severity 3" will just cause more paging… TWTTTTH!

Missing agents, vulnerable communications channels, secret principal names, and invalid names… a day in the life of an OpsMgr 2007 user.

Have you even been trying to configure an MS Operations Manager 2007 agent on a system and had it report no errors, but still have its status listed as “not monitored” in the OpsMgr console?  Have you wasted countless hours doing packet captures and advanced system debugging for weeks under the incorrect assumption that this was a network communications problem?  Have you ever had your consultant change the primary DNS suffix  on a server that you monitor without telling you, thus creating the whole problem in the first place?  No?  We read on anyway… if it happens to you later, you will know what to do.

A few months ago we had a consultant on site to set up services on two new Windows Server 2003 hosts.  It was another one of those fun “n-tier J2EE” things.  DNS names were requested for “hyperion10.uvm.edu” and “hyperion11.uvm.edu”.  However, since the hyperion hosts were connected to our  “campus.ad.uvm.edu” domain, their “internal” names were appended with the “campus.ad.uvm.edu” suffix.  Thus, these servers thought of themselves as “hyperion1x.campus.ad.uvm.edu”, even though there was no legitimate DNS entry for these names. 

For most services, this is not a problem.  However, we discovered that some hyperion services advertise themselves using this internal computer name, rather than a name chosen by the application administrator.  To work around the issue, we requested manual DNS entries be generated for “hyperion10.campus.ad.uvm.edu” and “hyperion11.campus.ad.uvm.edu”.  Unfortunately, the decision was made to change the hyperion hosts internal computer name DNS suffixes instead of waiting for the new DNS entries.  This solved his problem and did not create any immediate issues, so he moved on.  Months later, this decision would make the OpsMgr admin very unhappy.

Here is what broke… the Hyperion systems now tried to update the “DNS Suffix” attributes of their computer objects in Active Directory.  By default, Server 2003 AD performs “validation” on DNS suffix registrations, and disallows names that are not in the AD forest.  Thus, the DNS suffix change was denied in AD, and a event was logged in the System event log:

Source: NETLOGON
Event ID: 5789
Description:  Attempt to update DNS Host Name of the computer object in Active Directory failed.  The updated value was ‘HYPERION10.uvm.edu’.  The following error occured:
The parameter is incorrect.

This is pretty innocuous, and went unnoted.  However, the mismatch of the computer’s perceived FQDN and its registered FQDN in AD completely broke Kerberos authentication on this system.  Because AD did not know of a host called “hyperion10.uvm.edu”, it never generated a Kerberos SPN (Service Principal Name) for this host.  A legitimate SPN is required for Kerberos auth to function.  NTLM authentication still worked, so no one noticed the problem again. 

Two months ago, we installed an Operations Manager 2007 server.  All of our managed servers took their agents without complaint, except for the blasted Hyperion servers.  Since these systems were on the opposite side of a firewall, we naturally blamed the firewall and spent a lot of time performing “Wireshark” packet captures, looking at “netstat” output, and running “procmon” on the management server.  

The breakthrough finally came yesterday when I had a look at the Operations manager event logs on the hyperion servers (which were running the OpsMgr agents).   The following error was found in the log several times:

Source: OpsMgr Connector
Event ID: 21016
Description:  OpsMgr was unable to set up a secure channel to <fqdn of RMS> and there are no failover hosts…

I did some poking at news.microsoft.com in the operations manager groups.  I searched for threads with “agent” and “monitored” (as in the “not monitored” status of the agents in the console).  There I found the suggestion that Kerberos problems can prevent secure communications between OpsMgr agents and the RMS.  There was a suggestion that Kerberos loggin be enabled to rule this out as a problem.  Thus, I added the following reg values to the Hyperion servers:

Key: HKLM\SYSTEM\CurrentControlSet\Control\Lsa\Kerberos\Parameters
Value: REG_DWORD LogLevel
Data: 1

A reboot was necessary to activate logging.  Soon we had the culprit captured in the system log:

Source: Kerberos
Event ID: 3
Description:  A Kerberos Error Message was recieved:
on logon session

Error Code: 0×7 KDC_ERR_S_PRINCIPAL_UNKNOWN
Server Realm: CAMPUS.AD.UVM.EDU
Server Name: host/hyperion10.uvm.edu

Ah!  No principal existed for hyperion10.uvm.edu!  And thus, the OpsMgr agent could not create a secure channel with the server using Kerberos, which is the only method implemented in OpsMgr without resorting to certificate-based authentication.

Now that I knew what the problem really was, fixing the problem was easier (although not easy).  The following KB contained info on fixing DNS mismatches between the host and Active Directory:
http://support.microsoft.com/kb/258503

There, we are instructed to add the required Service Principal Name directly to Active Directory.  This was pretty easy… we just need the Windows Server 2003 Resource Kit Tools, and then we run:

setspn -a host/hyperion10.uvm.edu hyperion10

We also needed to fix the mismatch in DNS suffixes.  The KB above suggests removing the requirement for client computer DNS suffix validation throughout the entire domain.  This sounded like a bad idea to me, so I did some investigating, and found that you can modify the ACL a computer object in Active Directory to allow the “SELF” object to have “Write DNS Host Name Attributes” rights under the “Properties” tab in the AD Users and Computers MMC (also, there is “Write dNSHostName”… probably the same thing).  I added this right, then rebooted the servers.  The Event IDs discussed above all went away!  Start the party!  Pop the cork!  A quick agent re-install and our bloody Hyperion systems are now being monitored.

I am not sure what the moral of the story is… always grant your consultants rights to your DNS server?  Watch your consultants like a hawk 24×7?  Don’t bother with system monitoring as it is a time sink?  Always take a nap under your desk at lunch time?  Feel free to draw your own conclusions…