Skip to main content

opalis: monitoring grey (gray) agents in opsmgr

let's talk about gr(a|e)y agents.  (did you like my nerdy regex reference?)  a friend of mine fairly new to opsmgr was chatting with me about grey agents one day which lead to a search on how to detect them (since it's not native to opsmgr to do this).  this kind of spurred an idea of something to try out.

for those that don't know, grey agents occur when an opsmgr agent goes into a strange state where it's possibly not being monitored (not communicating, healthservice isn't receiving data, etc).  essentially, the agent looks grey.  more detail about grey agents and how to troubleshoot them can be found here.

since grey agents can lead to grey hair, let's look at how to find them.

 

detecting grey agents

andreas zuckerhut posted a powershell script that can quite easily get at this information through powershell.  here's the contents of the script:

$WCC = get-monitoringclass -name "Microsoft.SystemCenter.Agent"
$MO = Get-MonitoringObject -monitoringclass:$WCC | where {$_.IsAvailable -eq $false}
$MO | select DisplayName

simple, right? 

...and just for reference, here's a sql script which produces the same result:

SELECT ManagedEntityGenericView.DisplayName, ManagedEntityGenericView.AvailabilityLastModified
FROM ManagedEntityGenericView
INNER JOIN ManagedTypeView ON ManagedEntityGenericView.MonitoringClassId = ManagedTypeView.Id
WHERE (ManagedTypeView.Name = 'microsoft.systemCenter.agent') AND (ManagedEntityGenericView.IsAvailable = 0)
ORDER BY ManagedEntityGenericView.DisplayName

 

preparing for opalis

with this knowledge, you can do a number of different things to get this information to you in a useful way.  I decided since opalis is the playground I seem to be in most these days, I'd use that as the engine to make some stuff happen.  based on how you proceed, you could use the sql object or the run .net object from opalis to get the information.  I chose the powershell path.

to get this to work in opalis, there are a few slight modifications that had to be made to the original script.  basically, opalis needs the opsmgr snapin loaded since the default profile doesn't have it.  I suppose you could make the default profile load the snapin?  anyhow, here's the modified script:

add-pssnapin microsoft.enterprisemanagement.operationsmanager.client
cd operationsmanagermonitoring::
new-managementgroupconnection myOpsMgrServer

$WCC = get-monitoringclass -name "Microsoft.SystemCenter.Agent"
$MO = Get-MonitoringObject -monitoringclass $WCC | where {$_.IsAvailable -eq $false}


additionally, I removed the last line of the original script since there's no need to send this through the select cmdlet.

 

creating the opalis workflow

in a very simple sense, all you really need is one object, the "run .net script".  however, since we're in opalis, there are other useful things we could do.  anyway, here we go...

image

(I neglected to include a start parameter on purpose.  this is so that if you chose to do something like this, you could start it with whatever means necessary.  I would probably use a scheduler object and have it run every hour or so.  because of the way opalis handles multiple values going through the pipeline, it is necessary to use junctions and text files to hold the data together to pass to the "send email" object.  I'll document this in more detail in the next blog post.  for now, just keep this in mind that steps 1, 2, and 3 are of primary concern.)

the first step of this workflow kicks off the powershell script that detects grey agents.  we need to make sure the "run .net script" object is properly configured.  use the modified code snippet above for powershell as illustrated below.

image

to get the information out of this object, the variable in the script needs to be passed as published data. (if you need more information about it, I posted an article titled opalis: properly retrieving published data from powershell scripts that should be able to fill in the gaps.)  I set it up as follows:

image

after detecting grey agents, an attempt is made to reach the server.  if it's offline it would certainly explain its grey condition.  now, the cool thing about the link coming off of "get computer/ip status" is that it defaults to only sending the objects that return a success value which results in a list of computers that responded to a ping, yet are in grey status.

the last thing that occurs is to send that information to a designated recipient with the list of computers in the body of the message using the "send email" object.  that's pretty much it.

 

additional notes

so now you're wondering, what was all that junk up there?  why couldn't it have been as easy as this?

image

well, as I alluded to earlier, the way opalis handles multiple objects coming down the pipeline is rather interesting.  we'll talk about that more in this next post.

Comments

  1. when I run the add-pssnapin.... new-managementgroupconnection servername cmdlet via a PowerShell session on my Opalis Management Server, I'm golden. However, when I run the same cmdlet inside a "Run .Net Script" object - I receive the following error message:

    The term 'new-managementgroupconnection' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify that the path is correct and try again.

    I added Set-ExecutionPolicy Unrestricted to the top of the cmdlet and that didn't do anything. I've also ensured the OpsMgr console and command shell is installed on my Action servers. Any other ideas?

    ReplyDelete
  2. hey seth -

    just remember that anything that runs in opalis is generally executed under the action service account (or some other specified security context). in your case, the interactive powershell works because the cmdlet is most likely in your path somewhere -- where as it does not reside in a path for the action service account.

    i would recommend running a powershell prompt as the action service account and trying to run through the add-pssnapin part.

    ReplyDelete
  3. ok, i think we're getting somewhere now. i added my opalis action account (svc-opalis) to be a scom admin and now i can run powershell as the opalis action account and run all the commands successfully, whereas i was not able to do that before. but still when i run the script in the opalis, no dice. on my action server, the "opalis action service" is set to svc-opalis and the "opalis remoting service" is set to local system.

    ReplyDelete

Post a Comment

Popular posts from this blog

using preloadpkgonsite.exe to stage compressed copies to child site distribution points

UPDATE: john marcum sent me a kind email to let me know about a problem he ran into with preloadpkgonsite.exe in the new SCCM Toolkit V2 where under certain conditions, packages will not uncompress.  if you are using the v2 toolkit, PLEASE read this blog post before proceeding.   here’s a scenario that came up on the mssms@lists.myitforum.com mailing list. when confronted with a situation of large packages and wan links, it’s generally best to get the data to the other location without going over the wire. in this case, 75gb. :/ the “how” you get the files there is really not the most important thing to worry about. once they’re there and moved to the appropriate location, preloadpkgonsite.exe is required to install the compressed source files. once done, a status message goes back to the parent server which should stop the upstream server from copying the package source files over the wan to the child site. anyway, if it’s a relatively small amount of packages, you can

How to Identify Applications Using Your Domain Controller

Problem Everyone has been through it. We've all had to retire or replace a domain controller at some point in our checkered collective experiences. While AD provides very intelligent high availability, some applications are just plain dumb. They do not observe site awareness or participate in locating a domain controller. All they want is the name or IP of one domain controller which gets hardcoded in a configuration file somewhere, deeply embedded in some file folder or setting that you are never going to find. How do you look at a DC and decide which applications might be doing it? Packet trace? Logs? Shut it down and wait for screaming? It seems very tedious and nearly impossible. Potential Solution Obviously I wouldn't even bother posting this if I hadn't run across something interesting. :) I ran across something in draftcalled Domain Controller Isolation. Since it's in draft, I don't know that it's published yet. HOWEVER, the concept is based off

sccm: content hash fails to match

back in 2008, I wrote up a little thing about how distribution manager fails to send a package to a distribution point . even though a lot of what I wrote that for was the failure of packages to get delivered to child sites, the result was pretty much the same. when the client tries to run the advertisement with an old package, the result was a failure because of content mismatch. I went through an ordeal recently capturing these exact kinds of failures and corrected quite a number of problems with these packages. the resulting blog post is my effort to capture how these problems were resolved. if nothing else, it's a basic checklist of things you can use.   DETECTION status messages take a look at your status messages. this has to be the easiest way to determine where these problems exist. unfortunately, it requires that a client is already experiencing problems. there are client logs you can examine as well such as cas, but I wasn't even sure I was going to have enough m