Business processes monitoring system
The system monitors various metrics, starting from low-level metrics, such as CPU usage or network traffic and ending on high-level business metrics.
- Monitoring devices, application, business processes
- Notifying the staff
- Web-interface for configuration
- Web-interface for operators
- Graphing current values and trends
- CentOS 5.5
- Zenoss 2.5
Interview with the developer
Anatol Filin: Hello, Denis! I wanted to ask you a few questions about one of the projects we finished a while back. A lot of time has passed, and for a long time I’ve been wanting to get an idea of what exactly we did, what this project was, and what we weren’t able to accomplish. On our site, we’ve written half a page on it, but I don’t exactly understand what was done. I don’t want this to just be clear to me, but to everyone who visits our site. Maybe when people see what we’re capable of, they will be interested in ordering similar projects.
This is about the monitoring system for Taxcom, a fairly well-known company. This company is known for receiving tax returns and working with a large number of taxpayers. This is one of the biggest companies that perform electronic tax reporting, and we made their monitoring system. To get things started, in two words: what did the project consist of, how long did it last, what was the general outline of the project, and could you describe the project as a whole?
Denis Eldandi: The project consisted of the creation of a monitoring system. It lasted around one and a half to two months.
Anatol: What’s the infrastructure of the company? What does monitoring generally require?
Denis: The infrastructure of the company consists of several dozen servers and several units of networking equipment, all stored in three data-centers. We had to monitor the servers’ status, the network equipment’s status, all of their availability, and some of the details of their applications.
Anatol: A few dozen, let’s say around 50? Is that about right?
Anatol: These servers are geographically distributed, right?
Anatol: Was the infrastructure unix-based, windows-based, both or a hybrid?
Denis: A hybrid.
Anatol: So there was a unix and windows-based infrastructure?
Anatol: Okay, and if we were to look at this from the other side, who uses your final product?
Denis: : Taxcom’s technical management uses our system. At that time, they had just started forming a team of operators. Now, by management, I mean the team managers, the server administrator team’s manager and the development team’s manager.
Anatol: So they are the users in some sense? They enter into the system and get some kind of messages? What does that mean?
Denis: They receive messages via e-mail, enter into the system and watch the web-interface. That was then, but I think they gave these responsibilities over to their employees later on.
Anatol: I more or less understand the term “monitoring system” when we talk about infrastructures; for example, in situations like disk failure, when something burns out, there’s some event with the equipment and you need to quickly do something, delete some files, or there’s network equipment failure. But here, the system is called a business-process monitoring system, which means it doesn’t monitor equipment, but higher-level things?
Denis: Firstly, the system monitors business-performance indicators, but equipment is also monitored. Strictly speaking, saying that the monitoring system keeps track of an event where a hard disk breaks would be incorrect. There are no events initially; the monitoring system more or less examines the equipment normally and stores the information. The survey takes place once every one to five minutes. After each survey, the system compares the values against a criterion and says whether or not there was an error. If an error occurs, it creates an event. For this event, which is created by the monitoring system, a response is carried out. Actually, it’s the same with business processes; there is a certain set of parameters, it is possible to collect them, there is a set of criteria, and there is a set of rules by which events are created from these criteria.
Anatol: Can you give me an example? ?
Denis: For example, the processing time of a document through the workflow cycle. This application has an http interface (a webpage, actually), which shows the average processing time for a document for the past five minutes, for example. The system initially applies a timestamp at the beginning of the process and looks at that timestamp at the end of the process. As a result, the processing time is known. A small counter stores the average time.
Anatol: So, the standard average processing time is, let’s say, 5 minutes; what if it’s 20 minutes? Does that mean something in the system broke down and we have to figure it out, find which link in the chain has slowed down, broken, etc?
Denis: In general, this is usually looked at from the other side. There is a time limit, no more than 20 minutes for example, and if the limit is exceeded, then some measures are taken and there is a “debriefing”. A debriefing is facilitated by the collection of additional parameters, which are not part of the regulations; for example, the document processing time in the third stage. If it becomes clear during the debriefing that the document processing time in the third stage has increased, then you most likely need to “dig” towards the third stage.
Anatol: Interesting. However, probably in several instances, an integral parameter like document processing time through the whole system includes an enormous amount of smaller figures. If the document is processed slower, then it becomes obvious that something broke down or will soon break down somewhere in the system. So, if you previously had to follow hundreds of minor factors, such as drives, memory and the network, now there is a single simple index and more intuitive management. Am I understanding this correctly?
Denis: It seems so. But in practice, all high level indicators are fairly noisy, and if we have a 20 minute time limit, then it makes noise from zero to nineteen minutes. The system is simply made so that no parameter is cut out, and furthermore, nobody cares. You want to ask me whether or not we can watch the “main” parameter’s trends? In theory yes, but in practice there is little point to it. There are other, more trending parameters which are less noisy and smoother, and those are what everyone watches.
Anatol: Could you give me an example?
Denis: The number of database requests for example, that’s a good criterion. If you have a lot of traffic, then it will be smooth simply according to the law of large numbers.
Anatol: Now I don’t understand. How can this parameter indicate some kind of problem?
Denis: It doesn’t indicate a problem; it just makes you wonder, “What did we do yesterday that made it grow today?”
Anatol: So there’s a figure that can only jump if there are changes, like a new release for example?
Denis: Yes, or if there’s a breakdown. For example, we have a video in the project. If the number of requests to the project’s database increases, that means something happened with our system cache.
Anatol: I see. That’s what I’m interested in. There’s a big enough business related to technology and, odds are, the Internet. It earns money, payments are made and statistics are made of the data. Let’s say this business is an on-line store, a banner service or video service. It’s clear that there are employees who monitor the financial and business factors for the good of the system and company as a whole. The monitoring for this kind of company would involve higher-level processes. For example, we know that the company makes some thousands of dollars every day which haven’t been registered as payments, but are still somehow earned. This is done either through commission, which is only displayed at the end of the month through invoices or clicks and only if the site earns money from the number of advertisements displayed or clicks. Moreover, in many cases, management finds out about the financial results once a month. Can the monitoring system we’re talking about be integrated with business processes so as to allow managers to receive data on the status of the company on a daily basis (or even in real time)?
Denis: Obviously they can, but again, there is no intelligent logic in the monitoring system that can draw daily financial figures “from nothing”. Serious support is required from the company’s business system in order to give the monitoring system some kind of indicators; for example, “How much did we earn today?”; “If we worked the rest of the month like we worked today, how much would we earn in a month?” Never the less, it is possible to make short-term business projections based on current data. Why not?
Anatol: I want to ask you something else. Let’s say the system is monitoring some basic stuff, the network or something, and say there’s a failure somewhere – once, twice, three times… Actually, it is all current stuff, and management is hardly interested in it. If I were a manager, and today I see a 30% drop in the number of clicks compared to the monthly average, then I would definitely react to this. That is to say, as a leader, I am more interested in a deviation from the “norm” than how much we earned today. Can the technology we’re talking about keep track of these things?
Denis: Yes, of course.
Anatol: Okay, I would still like to understand how this is done. That is to say, was the system written from scratch, were any platforms used when writing it, and in general, what work was done in the Taxcom project?
Denis: The platform choice and configuration, the writing for its data collection modules, and the modifications to and refinement of the existing application and business system were done so the system could return these parameters.
Anatol: Tell me about the platform, please, which is used in writing the system. What does it consist of?
Denis: Taxcom was made on Zenoss. Zenoss is an open source platform which was used to do whatever Taxcom needed to do at the time. Modules were added on to this platform.
Anatol: Were there any licensing costs for the platform?
Denis: No, there were no licensing costs.
Anatol: Are there any restrictions related to the platform’s work because of different operating system?
Denis: As far as I know, the system only works on Linux, but I’m not sure. It may also work on Windows, but frankly, I doubt it. It probably works on FreeBSD.
Anatol: Could you please explain what this platform allows you to do?
Denis: It allows you to collect data and “add” their history in its database to draw graphs. It allows you to configure data criteria as well as to configure the regulations for notifying employees and operators by e-mail. Of course there are some drawbacks, but at that time, they were not show-stoppers for Taxcom. The company came to the conclusion that the platform is well-balanced between it shortcomings and implementation costs and support.
Anatol: You used the word “graph”. Despite the fact that users receive notifications, alarms, text messages, e-mails, etc., people mainly do still look at the graphs?
Denis: Yes, people still look at the graphs. All of this system’s indicators are numerical; even those which indicate “working” and “not working” are encoded as a “0” and “1”. The history of these indicators can be viewed on graphs.
Anatol: Okay, I properly understood that the monitoring system operates on the Zenoss platform, and scripts and application plug-ins run with it? So there are scripts which monitor and can interact with equipment, and applications which components need to be installed into which understand the Zenoss protocol?
Denis: Yes, this platform has a very simple API for scripting data collection; you can see it in any language. As needed, these scripts turn to applications for data (as they are written). In the simplest case, they ask for an html-site from a web application. Depending on the address, there is a number for that site. http://application.server.com/monitoring/http_hits/, for example. Look at the http, that is a simple number – 18, for example.
Anatol: I understand. I heard a familiar word—http. The interaction between the platform and data collecting scripts is carried out on the http?
Denis: Not necessarily. I gave an example of an interaction that can be performed simply.
Anatol: How else can they be performed?
Denis: There are many different options. For example, they can be performed via SNMP and SQL. However it’s convenient.
Anatol: As far as I know, we put a monitoring system on all the systems that we create and maintain. The question is: which platforms do we use on our systems, is Zenoss everywhere?
Denis: No, Zenoss isn’t everywhere. We use Zenoss for video hosting, but use Nagios for the Shopium project, for example. Each platform has its advantages and disadvantages, which is why it’s convenient for one in one place, but convenient for another somewhere else.
Anatol: Could you tell us anything else interesting either about the project or in general about monitoring systems?
Denis: I’ll say this: on the market now, there are no monitoring systems that combine all the advantages of Zenoss and Nagios. If there had been a bigger order at our disposal, then we could have made a really good system that we could have sold.
Anatol: Interesting, but what exactly could be sold?
Denis: The installation.
Anatol: Do you mean to make the platform and sell it?
Denis: The platform is, of course, an overstatement. Major customers who are interested in having a monitoring system in their office don’t need a platform. It’s all the same to them, whether there’s a platform a not. What’s important to them is being able to communicate with the system. And how much money it costs.
Anatol: Okay. And how significant were the shortcomings of open and free systems? Enough so that customers wanted something different?
Denis: All of these systems are extremely complicated. On one hand, they are designed for excessive flexibility, on the other, they force a specific pattern of use. Using the Zenoss platform, you need to put up with the fact that the result of any metric, if it’s in graph form, is numerical. Of course you can put up with this, but it’s still uncomfortable to view these graphs.
Anatol: When for example? Could you give an example of a more natural figure display?
Denis: The efficiency criteria for example.
Anatol: But the figures in any case, whether numerical or binary (which, as a matter of fact, is also numerical), are “0” and “1”.
Denis: : It is just not very comfortable to examine “0”s and “1”s on a graph that depicts numbers.
Anatol: Tell me, are these systems expandable? Could something be appended or ascribed to them like plugs or something else?
Denis: It’s possible to append something, but it can be problematic since Zenoss, for example, was written in the extremely poor Python framework.
Anatol This is a general question: how useful are monitoring systems?
Denis: They’re quite useful. Besides the obvious, like what I’ve already mentioned, reacting quickly when something breaks, or the less obvious, such as reacting before something breaks, they’re useful for debriefing. These systems allow you to quickly find out where something broke and how to fix it. A common and very good practice is to constantly add new monitoring parameters based on the recovery of a crash. As a rule, after fixing some coding errors, the system updates its knowledge banks with, for example, “if we were following this, we would have been able to predict this kind of problem.” And this data is added to the system.
Anatol: And the system stabilizes…
Denis: Stabilizes is an overstatement. The system becomes a bit more predictable. This, naturally, is an iterative process.
Anatol: The monitoring system lets you use accumulated knowledge and, more so, doesn’t repeat errors… It doesn’t make the same mistake twice?
Anatol: ValueCommerce also used a monitoring system. After it was installed and was up and running, we had operators on duty around the clock; they followed events, watched the monitors, etc. Is this the typical use of monitoring systems, or does it basically not need operators present? What does that depend on?
Denis: Operators come into a very nice symbiotic relationship with monitoring systems. They actually do much more than just watch the monitor and call the engineers responsible. They can analyze events that have occurred and make certain decisions. They can continue to add or change notification regulations or existing metric parameters, for example.
What’s interesting is they come into that knowledge independently. For example, let’s say they see a system element break every day. If they call the engineer responsible, he’ll say to them, “No, everything’s fine.” As a rule, if you wake an engineer up in the middle of the night (and a system error may occur at any time), he won’t especially try to change the system that day so that he isn’t woken up the next night. It’s some kind of strange human factor. Operators though can enter a bug in the runbook, and then apply for a change in the criteria so that at three o’clock, the error no longer appears. Typically, these time-based criteria are detrimental and force engineer to think, “Why is the parameter rejected at three o’clock in the morning? Something of ours probably isn’t working right”.
Anatol: So, the presence of an operator is useful in some cases, but not always necessary, and it depends, to some extent, on the size of the infrastructure, the size of the business, etc? Does it seem that operator presence also depends on the risks associated with a particular system’s operations?
Denis: Yes. For example, if, from the point of view of a particular business, the fact that the system breaks down at night is not a big deal, then you don’t need around the clock operators.
Anatol: Never the less, will notifications like alarms and text messages go out to the engineer responsible if something breaks?
Denis: Again, I’ll say this. A lot of people suggest: “Let’s send out a text message to the engineers.” At the same time, the engineers are people, too. They can be drunk or their phones might be turned off. As a result, they won’t get the text message. So then they suggest: “Let’s send a text message to one engineer first, and if he doesn’t get it, we’ll send it to the next, and then their manager.” It sounds more or less like a working scheme, but in practice, it just leads to people turning off their phones at night.
Anatol: Does Taxcom employ around the clock operators?
Denis: Yes. Actually, Taxcom doesn’t use operators; it has on-duty engineers that are in the office at night.
Anatol: Thank you for the interview, Denis!
Denis: Yes, any time, I’m happy to help.