Thursday, February 15, 2007

Every project needs a name

In the first two posts, I have loosely referred to this project as a "comprehensive enterprise monitoring solution". Not a great name. So I've thought of a better one. By tying the acronym for Enterprise Monitoring (EM) with a word which describes what the human equivalent of system monitoring is all about, we get "Empathy -- How do your systems feel?" Ok, the catch phrase could use a little work. This is more than a Systems monitoring solution. It's got to include applications, services, and devices as well. But it sounded good.

So effective immediately, when I refer to "this project", I'll be calling it the "Empathy Project", or just plain "Empathy".

Change Management

eWeek just published an article regarding Tripwire and how it targets change control. I am a believer that Change Management is an important element in a comprehensive monitoring solution, so let's talk about it.

ITIL defines the Change Management as:
"The process of controlling changes to the infrastructure, or any aspect of services, in a controlled manner thus enabling approved changes to be implemented with minimum disruption."

At first glance, this doesn't sound like it has much to do with monitoring, but let me explain why it does. Frequently, when making changes to live systems, we have to take them offline in order to complete the change. A monitoring system, unaware that a change is in process, will interpret this outage as an alarmable event, and initiate notifications up the chain. Depending on how the notification rules are set, that could mean that a corporate executive is going to be awakened from his sleep in the middle of the night. Any engineer can tell you, it's not a good idea to wake the boss up in the middle of the night unless it's something truly important. If the monitoring system were integrated into the change management process, no alarm would have been raised, and the boss would not have been awakened. The scenario may seem a little far-fetched, but having been a weekend NOC operator for a company which provided remote systems monitoring, I can not count how often I have had to wake up managers in the wee hours of the morning because a technician performing maintenance forgot to notify the NOC that systems would be going offline.

There are many ways to integrate the change management process with the monitoring solution. There are, however, only two absolute requirements. First, the organization must actually HAVE a change management process. Second, that process must include a mechanism for temporarily disabling monitoring of systems affected by a change. There are some automated mechanisms which could be implemented, saving the engineers the added step of manually adjusting monitoring before and after every change.

The tools used to enact a change can be modified to inform the monitoring system that a change is about to occur or has been completed. As an example, a script used to stop an application can write an entry to a log file indicating that the application is about to stop. The monitoring system, configured to watch this log file, sees this notice and automatically stops monitoring that particular application. The script used to start the application back up can write an entry to the same log file, causing the monitoring system to re-enable monitoring. While this can be an extremely effective mechanism for avoiding false alarms, it requires a lot of changes to be made to the applications and the tools that control them.

If there is a Change Management application in place, the monitoring system can be made to check that application if an event is detected, to identify whether or not the impacted system is experiencing a change. The granularity of this may be questionable, depending on how much detail is included in the Change ticket. This is also an inefficient mechanism, since each event will have to be cross-referenced against Change Management before an alarm can be created.

The Change Management system could be made "Monitoring Aware". When a change is scheduled to occur, the Change Management system can instruct the Monitoring system to disable monitoring of the affected systems. Again, the granularity of this is dependent on the level of detail provided in the Change ticket, but a fully integrated solution could require that the person creating the Change ticket select which monitoring elements will be disabled as part of the ticket creation process. This is also a highly efficient mechanism for eliminating false alarms. If the monitoring system is instructed that a change is going to occur and to disable monitoring, it does not have to rely on its own, potentially fallible, detection systems for identifying whether or not an event is alarmable or whether to disable, or re-enable monitoring.

Although it would be nice, it is unreasonable to expect that 3rd party Change Management products are going to make the decision to be "Monitoring Aware", and they are most certainly not going to make any attempt to integrate with the product we are discussing on this blog -- at least not any time in the near future. I propose that a proper Change Management process is an integral function of a comprehensive monitoring solution, and as such, the development of Change Management system fully integrated with our Monitoring System is within the scope of this project.

Consequently, I'd like to find a Subject Matter Expert in Change Management and Change Management systems, who is willing to join our team. As I discussed in my initial post, the Subject Matter Experts will be involved in the design of this monitoring product, and once there is a product to release, the SME's will be called upon as consultants when we do deployments and support. Please leave me a note in the comments if you're interested in participating in this role. If you've just got some input on the thoughts I've laid out above, also, please do so. I want the involvement of the entire community throughout the life of this project.

Wednesday, February 14, 2007

An Open Invitation

Greetings and Salutations,

First, let me introduce myself. My name is Jim Phillips. I'm an independent consultant who has occupied many different roles within the Information Technology world. It seems that over the last several years, my choices in contracts have led me down the path towards becoming an expert in enterprise monitoring. I can't claim that I'm THE expert by any stretch, but I think I can hold my own. My background is in systems and applications management. I have written some impressive tools for monitoring and managing systems, but they have been small, single function tools written in Perl. I don't have the programming experience necessary to create a feature complete product.

I've started this blog as an information gathering and recruiting tool. My intention is to design and develop the "ultimate" enterprise monitoring system, and then to unleash it upon the world as both an Open Source platform with commercial support, and a hosted solution. To do this, and more importantly, to do it right, I need help from the community. I'd like input from engineers at large and small companies on best practices for monitoring. I myself will be writing a series of articles on this blog discussing my experiences in monitoring and what I feel should be included in a comprehensive monitoring solution and I openly invite anyone to either expand upon my ideas or completely pick them apart in the user comments -- I'm not afraid of a little public humiliation if it makes for a higher quality product. I hope to assemble a core team of subject matter experts in areas such as Servers, Networking, Unix, Windows, Applications, Facilities, Security, eCommerce, Web Farms, and a host of others. These experts will be invited to write for this blog and publish articles on monitoring with respect to his or her area of specialty.

In addition to engineers helping me to define the requirements for the solution, I'm going to need to enlist the support of developers to bring this dream to fruition. The requirements for the solution haven't yet been defined, so I can't yet say what special skills will be required of these developers, but I hope to be able to put together a core team of skilled professional developers, and also enlist the aid of the open source community to help test and create ancillary tools and plug-ins for the system. I can say there will be a need for user interface designers and web application developers, as I absolutely want the system to employ a standards compliant web interface for the user.

This may seem like a lofty goal for the open source community. As with many open source projects, this one will begin with no funding. I don't want it to stay that way forever. I've already enabled Google AdSense on the blog, and I'll be establishing other channels through which the community can make financial, hardware, and software donations necessary for the development of the project. Once the product reaches a releasable state, we'll start bundling support contracts, and putting our Subject Matter Experts into consulting roles. If everything goes well, what is now a volunteer effort may become a profitable operation with full time employment opportunities for developers, subject matter experts, and support personnel. But the product will always remain Open.

So welcome one and all to the beginning of a dream. Let's work together to make it a reality.


Jim Phillips

PS, Until I've established a better mechanism, please use the comments to contact me if you're interested in participating.