I'll try to explain why we think this is happening and what can be done to fix the issue.Symptoms:
Periodically you receive error message about Management Server not being able to access nworks logs or nworks context in WMI. We have a KB article about that. http://www.veeam.com/KB1600Cause:
Because of the nature of our product we create quite large amount of objects which are not managed or controlled by virtualized infrastructure, all VMs, ESX hosts, Datastores and vCenter servers exist on top of infrastructure which is running inside VMs. In SCOM most of objects(at least objects which have workflows) should be managed by a certain Health Service (Agent or Management Server). For instance if SQL MP creates a database or a table object - it should be managed by Health Service which is running on this SQL server. If something happens with this Health Service, but the created object still exists and is not managed by any entity - it will be managed by RMS. That's a standard SCOM behavior.
In case of our MP, all virtual infrastructure objects are managed by collector servers and the respective Health Service which is installed on this machine. But there are few exceptions:
1. VM <->Windows Coumputer relationship. It's disabled be default, but if you enable it, our MP will "ask" all SCOM agents to create this relationship which also creates a "ghost" virtual machine object which is not managed by any Health Service(it cannot be managed by this agent, because all VMs are managed by collector computers) which should merge with "real" VM thus creating a relationship between virtualized environment and virtualization infrastructure.
2. Split vSphere clusters - on our collector you can split a cluster if it has too many objects. To support this, our collectors create several version of the same cluster objects and only one of them are managed by a collector machine (because other objects are just "ghost" objects and should not be managed).
Usually everything should work like a charm - our MP creates relationship, "ghost" VM merges with real VM which is managed by collector - everything is working as it should. BUT under certain circumstances(most common is different VM name and NETBIOS name of the computer inside) this "ghost" object may never merge with real VM and at this point it will be orphaned and will be managed by RMS, when RMS tries to manage it, it will try to open nworks event log, which RMS usually doesn't have - that's why you see errors.
Same thing with split clusters, usually only real cluster is active in SCOM and collector manages it, while "ghost" cluster objects from other collectors just "merge" with the real one. BUT, if collector with real cluster is experiencing issues uploading data to SCOM, "ghost" version of the same cluster may arrive to SCOM earlier and RMS will try to manage it. Which is ... yep, same errors again.
There is also one more thing - sometimes our customers just kill VMs with agents running inside. This is not a very healthy practice for SCOM, it's better to uninstall the agent first. Now, if you killed a VM, our MP will remove it form topology, but If our MP created a "ghost" VM object and this object has been merged with "real" VM, when real VM is removed - SCOM may keep "ghost" version of it because stub object of SCOM agent is still present in the database. And RMS could still try to manage this "ghost" version of killed VM.Resolution:
- First of all, if you are using VM<->agent relationship discovery, make sure VM name and netbios name of the guest os is the same. If this cannot be done, create a group for VMs and create an override for this group for "VMGUEST contains OpsMgr Agent" discovery. You will need to enable "removeRelationsip" property of this discovery. This will instruct SCOM to remove "ghost" VM objects.
- When you are decomissioning a VM, remove SCOM agent from this machine before killing it. I beleive there should be some process in place for decomissioning a VM (i.e. remove a machine from AD and any other corporate systems including SCOM). If you still have stub objects in SCOM, try to run "Remove-DisabledMonitoringObject" command in SCOM shell ("Remove-SCOMDisabledClassInstance" in SCOM 2012). If you still have stub VMs, try to do the same thing I suggested for VMs with different names above.
- Perform "Rebuild full topology" task in the nworks Enterprise Manager UI, this should do the final "clean-up" of the nworks topology.
The issue with split cluster should be self-healed when "real" verison of cluster objects arrive to SCOM. However there are a lot of other cases - for instance SCOM agent crashed, or our collector or VM with collector has been lost due to some disaster. All such cases should be investigated, I would recommend to open a support ticket with our support team - we will need to know a lot of things, like all logs from your nworks deployment.
Also I've created a script for SCOM 2007 (let me know if you need one for SCOM 2012) which displays cluster and VM objects which are orphaned. I've attached it to the post. The script could help you determine which objects are causing the issue with Event Logs or WMI.
Let me know if you have any other questions.
P.S. Of course we are going to re-factor MP architecture to avoid situations like this in our future versions.