Thursday, February 15, 2007

Change Management

eWeek just published an article regarding Tripwire and how it targets change control. I am a believer that Change Management is an important element in a comprehensive monitoring solution, so let's talk about it.

ITIL defines the Change Management as:
"The process of controlling changes to the infrastructure, or any aspect of services, in a controlled manner thus enabling approved changes to be implemented with minimum disruption."

At first glance, this doesn't sound like it has much to do with monitoring, but let me explain why it does. Frequently, when making changes to live systems, we have to take them offline in order to complete the change. A monitoring system, unaware that a change is in process, will interpret this outage as an alarmable event, and initiate notifications up the chain. Depending on how the notification rules are set, that could mean that a corporate executive is going to be awakened from his sleep in the middle of the night. Any engineer can tell you, it's not a good idea to wake the boss up in the middle of the night unless it's something truly important. If the monitoring system were integrated into the change management process, no alarm would have been raised, and the boss would not have been awakened. The scenario may seem a little far-fetched, but having been a weekend NOC operator for a company which provided remote systems monitoring, I can not count how often I have had to wake up managers in the wee hours of the morning because a technician performing maintenance forgot to notify the NOC that systems would be going offline.

There are many ways to integrate the change management process with the monitoring solution. There are, however, only two absolute requirements. First, the organization must actually HAVE a change management process. Second, that process must include a mechanism for temporarily disabling monitoring of systems affected by a change. There are some automated mechanisms which could be implemented, saving the engineers the added step of manually adjusting monitoring before and after every change.

The tools used to enact a change can be modified to inform the monitoring system that a change is about to occur or has been completed. As an example, a script used to stop an application can write an entry to a log file indicating that the application is about to stop. The monitoring system, configured to watch this log file, sees this notice and automatically stops monitoring that particular application. The script used to start the application back up can write an entry to the same log file, causing the monitoring system to re-enable monitoring. While this can be an extremely effective mechanism for avoiding false alarms, it requires a lot of changes to be made to the applications and the tools that control them.

If there is a Change Management application in place, the monitoring system can be made to check that application if an event is detected, to identify whether or not the impacted system is experiencing a change. The granularity of this may be questionable, depending on how much detail is included in the Change ticket. This is also an inefficient mechanism, since each event will have to be cross-referenced against Change Management before an alarm can be created.

The Change Management system could be made "Monitoring Aware". When a change is scheduled to occur, the Change Management system can instruct the Monitoring system to disable monitoring of the affected systems. Again, the granularity of this is dependent on the level of detail provided in the Change ticket, but a fully integrated solution could require that the person creating the Change ticket select which monitoring elements will be disabled as part of the ticket creation process. This is also a highly efficient mechanism for eliminating false alarms. If the monitoring system is instructed that a change is going to occur and to disable monitoring, it does not have to rely on its own, potentially fallible, detection systems for identifying whether or not an event is alarmable or whether to disable, or re-enable monitoring.

Although it would be nice, it is unreasonable to expect that 3rd party Change Management products are going to make the decision to be "Monitoring Aware", and they are most certainly not going to make any attempt to integrate with the product we are discussing on this blog -- at least not any time in the near future. I propose that a proper Change Management process is an integral function of a comprehensive monitoring solution, and as such, the development of Change Management system fully integrated with our Monitoring System is within the scope of this project.

Consequently, I'd like to find a Subject Matter Expert in Change Management and Change Management systems, who is willing to join our team. As I discussed in my initial post, the Subject Matter Experts will be involved in the design of this monitoring product, and once there is a product to release, the SME's will be called upon as consultants when we do deployments and support. Please leave me a note in the comments if you're interested in participating in this role. If you've just got some input on the thoughts I've laid out above, also, please do so. I want the involvement of the entire community throughout the life of this project.

No comments: