Troubleshooting APM SCOM 2012

Recently I had the opportunity to troubleshoot SCOM at a client, here’s my recent experience with APM.

My client had two web servers, not joined to the domain. Both of these servers hosted an application that we wanted to instrument with APM. 01 wasn’t working with APM but was working with other OpsMgr alerts and rules. 02 was working for APM and for other alerts and rules.

  • These servers are not domain joined (verified the runas account, cert thumbprint, communication port, etc…)
  • The two servers are exact clones of one another ( same serialnumber listed in SCOM, not ideal but unsure if this is causing the issue or not)
  • OpsMgr agent installs on both 02 – gets the instrumentation 01 – does not
  • Both servers show the same discovery information from the IIS MP
  • Once I setup the .NET template, here’s a list of high-level troubleshooting steps
    • Created a custom group containing only web01 and scoped a new .NET monitor to the group
    • On the .net template, I removed the group and flushed the health services
    • Tried putting Web01 in maintenance mode for 10 min
    • Verified discovered inventory on the health service
    • Viewed the failed / applied rules & monitors (no failures)
    • setup a new .NET template from scratch
    • Notta

The entire time, I’m closely watching the event logs on Web01 for any signs. Come to find out there was a bug in 2012 SP1 where, if the discovery of the server name shows up in lowercase and in uppercase throughout the OpsMgr Console it may cause APM some heartburn. So I upgraded to 2012 SP1 UR 4, which contains the fix – Read this here’s the link to UR4

Unfortunately, the issue wasn’t resolved, so I went through the steps once again with no positive result. Next I enabled Verbose tracing and created the issue again, stopped verbose tracing, and went through the logs.  Decided to reapply the UR4 server update, ran LODCTR /R on Web01, waited 5 minutes, and all was fixed.

Advertisements

Manually Uninstall a Stubborn SCOM Agent

At work, I was recently given the task of upgrading our SCOM environment from 2007 R2 to 2012 SP1. “No problem, boss!” I said boastfully. After all, upgrades are typically seamless and always go smoothly.

After I took my medicine to cure delirium, I began flowing through and reached a point where I needed to uninstall the 2007 R2 agent from a machine. However, whenever using the SCOM console, the uninstall failed for unspecific reasons. “No big deal,” I said. “I guess I will just have to manually uninstall. That should be easy!”

The medicine had apparently worn off.

When I attempted my uninstall, I ran into the following:

Image

Alright! I’ve seen that a million times! “Error 25205 Failed to uninstall SDK MOF blah blah stuff and things I’m totally screwed.”

After 30 minutes of pure panic and 15 more of stress-eating yesterday’s donuts, I began the troubleshooting steps that eventually fixed this issue:

1) From a machine with a healthy SCOM agent, I copied mom_tracing.mof from “%ProgramFiles%\System Center Operations Manager 2007” to the same directory on my troubled machine:

Image

 

2) Then, on the machine that was barking at me, I ran (from an elevated command prompt) mofcomp mom_tracing.mof:

Image

 

Naturally, I ran into the above error. Time to freak out, right? Nah…we’re good. I just went to steps 3 – 5:

3) I mapped a drive on my troubled machine to %WinDir%\system32\wbem on a machine with a healthy agent and copied *.mof and *.mfl to the same directory on my local machine:

 

Image

 

4) Then, from an elevated command prompt, I ran for /f %s in (‘dir /b *.mof *.mfl’) do mofcomp %s to compile all that I copied over:

Image

5) Then, a quick restart of the winmgmt service and I was able to uninstall the agent:

Image

After I cleaned up the piles of donut crumbs and changed out of my soiled drawers, I was then able to move forward knowing that I would not run into anything else that difficult during my upgrade! (Time to take my medicine again)