The situation is basically that, when using the IP Ping agent to monitor devices, ZSM is very erratic and unreliable at reporting when services are actually up or down. You can generate hundreds of false alerts over a very short timespan, and the problem is easy to reproduce.
The source of the problem is quite laughable. Those who know ManageWise know that it was written and sold long before TCP/IP came into prevalence within corporate LAN's. The TCP/IP Ping monitoring component, again last updated in 1999, sends pings out in some type of sequence, and - get THIS - expects to receive them back in exactly the same sequence. If they come back out of order - something OTHER than first-out-first-in - viola! False alert!
I wish I was kidding.
It makes sense, sort of, that this problem would have existed. IPX was rarely ever routed across WAN links. Usually, only big IT shops had something like ManageWise in place. They usually only wanted to monitor local services - in fact, in some cases, they only could manage local services. Really big IT shops had network groups that used OpenManage or something along those lines. Using it to monitor a large WAN with slow links was just never considered.
What doesn't make sense is that the feature remained in the product all this time, and that the problem was never again tested for or uncovered.
The solution, as the title of this post implies, requires a "significant architectural change". Which is to say you evaluate each ICMP reply based on it's merits, not the order in which it arrived back at the server (why would this concept be so foreign?).
A number of software developers in Bangalore will be working to correct the problem over the next few days - hopefully it'll bear fruit and we will have again facilitated a fix to the ZSM product that nobody else ever found or yelled about loudly enough.
Why hasn't this problem been found in 6 years? Based on what we know, there are but a few rational explanations for the situation as it had evolved:
- Hardly anyone else is using the product - or this portion of it - in production.
- Those that are, might be in single-campus environments where all of the remote links are high-speed.
- Others have tried to use it, unsuccessfully, and gave up - perhaps citing excessive difficulty in it's configuration or faulting their own abilities.
- Still others may have found the problem, opened incidents, and got nowhere - again, disheartened, they punted in favor of another solution.
- Nobody tested this in a real-world network (we are also solely responsible for the "Unnumbered Links" fix that is now in ZSM).
- Novell very plainly does not eat their own dog food, despite any claims to the contrary you may hear.
Ask someone at Novell's IS&T if they ever clustered GroupWise on NetWare prior to OES. Ask them if they use ZSM to monitor & manage their network. You'll be surprised how little of what you buy is actually used by the people who make it. That's not the way I would run a business, but hey, that's just me.
It's not unreasonable for us to feel like we're beta-testing product when we uncover problems like this. Again, if technology companies develop for worst possible case - which is honestly not that much harder to test around - they are able to accommodate any case. Heaven forbid anyone spends some additional time to get something right.
Apologies in advance to whomever coded this thing way back when, but everyone responsible for this oversight between 1999 and mid-year 2005 should be blackballed from ever developing, testing, or managing a software product ever again. This is just so simple and obvious to anyone using the product, that it's oversight is paramount to wanton neglect.