The new stage of system monitoring is better integrated

New tools are raining down on system administrators these days, attacking the “monitoring sucks” theme that was pervasive just a year ago. The new tools–both open source and commercial–may be more flexible and lightweight than earlier ones, as well as more suited for the kaleidoscopic churn of servers in the cloud, making it easier to log events and visualize them.
 Read full article by Andy Oram on O'Reilly.com

A case for presenting timeseries data at the right granularity

In timeseries monitoring temporal granularity refers to the frequency with which data points are emitted - in other words, individual data inputs are gathered for a certain period of time, and then they're summarized statistically and presented on the time series. This "certain period of time" describes the temporal granularity of the time series. For example, Amazon AWS CloudWatch metrics have their finest temporal granularity at 1 minute.
A time series chart with an Amazon CloudWatch metric displayed with 1 minute granularity
In monitoring computer systems granularities of between 1 minute and one hour are prevalent. The greater the amounts of data a system has to process, the finer the granularity is desired for precise monitoring. On many occasions 1-second granularity is necessary to reliably establish causal relationships between system components (such as what started a series of events?)

At other times, important trends can be missed precisely because the data are viewed at too fine a temporal granularity.

Suppose you're monitoring a computationally expensive batch job that runs every hour. The success of the computation is non-critical as long as it succeeds eventually, at least once every eight hours. On every successful run, a data point equal to 1 is reported, a failed job does not publish any data.

Viewing data points of the previous week at an hourly granularity looks roughly like this:
It's not immediately obvious, but at closer introspection the computational job was failing for too long twice, between Friday and Saturday. Now, the looking at a time interval of three months makes it even less obvious.
However, aggregating the same 90 days worth of data at a much coarser 8-hour granularity reveals a clear, regressive trend that started around the 24th of June:
In this case the very gradual regression is an indicative of hitting capacity and scale limits, and the trend could have been detected reliably detected a month before it developed into a serious problem.

It's out!

Okay, so the book is out. Finally!

Effective Monitoring and Alerting is one ops engineer's perspective on how to deal with every day’s uncertainty.

Work in ops is interrupt driven: it requires the uncanny ability to context-switch at extremely high rates. There is never enough time, yet there seems to be enough work regardless of the size of your team. Ops can be very frustrating, but if done right it gives a great sense of accomplishment and it's a good probably the best place for career growth. 
Monitoring of distributed computer systems is not a trivial task; learning it takes time and many of us make the same mistakes along the way. I believe it needn't be so, and with this book I'm attempting to save my reader some precious time, by sharing some universal truths that should apply to his/hers environment regardless of its architecture or the type of monitoring system in use.

What's inside?
  • Learn how to draw the right conclusions from system metric,
  • Develop a robust alerting configuration that can identify problematic anomalies—without raising false alarms,
  • Address system failures by their impact,
  • Plan an alerting configuration that scales with your expanding system,
  • Choose appropriate maintenance times automatically,
  • Develop a work environment that fosters flexibility and adaptability.
Where can I get it?

The book is available as paperback and e-book on O'Reilly.com, Amazon, Safari Books and others. For Kindle Edition click here.

Thanks to all tweeters and likers for showing support and spreading the word!