Effective Monitoring and Alerting is one ops engineer's perspective on how to deal with every day’s uncertainty.
Work in ops is interrupt driven: it requires the uncanny ability to context-switch at extremely high rates. There is never enough time, yet there seems to be enough work regardless of the size of your team. Ops can be very frustrating, but if done right it gives a great sense of accomplishment and it's
a good probably the best place for career growth.
Monitoring of distributed computer systems is not a trivial task; learning it takes time and many of us make the same mistakes along the way. I believe it needn't be so, and with this book I'm attempting to save my reader some precious time, by sharing some universal truths that should apply to his/hers environment regardless of its architecture or the type of monitoring system in use.
- Learn how to draw the right conclusions from system metric,
- Develop a robust alerting configuration that can identify problematic anomalies—without raising false alarms,
- Address system failures by their impact,
- Plan an alerting configuration that scales with your expanding system,
- Choose appropriate maintenance times automatically,
- Develop a work environment that fosters flexibility and adaptability.
The book is available as paperback and e-book on O'Reilly.com, Amazon, Safari Books and others. For Kindle Edition click here.
Thanks to all tweeters and likers for showing support and spreading the word!