August 13, 2023

Monitor For The Absence Of Success, Not Just the Presence of Failures

Let’s start with a brief story: last year one of my teams built a new system that was in a limited beta. The system was dependent on receiving messages from an upstream system owned by a different team in an entirely different part of the company. It processed those messages, using them to display something to customers that had a meaningful positive impact on purchase conversion for the small number of merchants that were using the new feature.

After an intense and successful Black Friday / Cyber Monday period, most of the team was looking forward to taking a well-earned rest around the end of the year. We were confident that our on-call rotations and monitoring would let people mostly unplug and recharge, while knowing that if anything went wrong, we’d know about it.

Unfortunately, when we came back from a week off, we realized that the buyer-facing feature had essentially been disabled for a week and we didn’t know.

What happened?

The upstream team got nervous about their ability to support the new system and turned it off before they went on break. This could have been a valid business decision, but unfortunately they forgot to tell us. So we were surprised to learn that this new and valuable feature was essentially disabled for a week.

How did we miss it?

We did have monitoring in place.

We had alerts configured to page us if the job that processed the messages failed. And we had other alerts configured that would page us if there was a backup of messages.

What we didn’t have was an alert that paged us if no messages were sent.

The Lesson

When setting up comprehensive monitoring you must monitor both for the absence of success as well as the presence of failure.

Monitoring for the absence of success is monitoring for when something you expect to happen didn’t happen as opposed to only monitoring for when an unexpected failure did happen.

I’ve found that teams naturally monitor for failures. The first alerts that get set up are generally for spikes in 500s, errors, or failed background jobs. And those are good! But they aren’t sufficient.

The goal of monitoring is to have confidence that your system is working as expected. To do that you need to monitor whether the key outcomes of your system are succeeding.

Common Challenges

Monitoring At The Right Level of Abstraction

I find that when monitoring for the absence of success it is important to define the success at the right level of abstraction. In the instance above we could have directly monitored for the number of messages successfully processed.

But we might have been better off monitoring at even an even higher level of abstraction (the number of times we displayed the buyer facing feature)

Monitoring at higher levels allows you to catch more issues with a single monitor. For instance, monitoring that the expected number of messages have been processed would have caught the issue where the upstream system stopped sending messages, but it wouldn’t have caught an issue where we failed to display the buyer facing feature. Monitoring the number of times we displayed the buyer facing feature is better because it would have caught an error in any part of the system and also is monitoring the action that directly created business value.

Seasonality

Many systems have a natural seasonality of usage within a day. If you run a service with overwhelmingly North American user base there won’t be a lot of usage at 4am ET / 1am PT.

There are a couple of ways to handle this. The simplest is to configure your monitoring to only alert during times of day where there is generally enough usage to confidently detect abberations.

High Cardinality of Data

High cardinality refers to having a large number of groupings of values in a dataset. In the context of monitoring, it can make detecting specific anomalies more complex due to the sheer variety of potential failures.

For example, at LivingSocial, we negotiated different rates for each of our business lines in various countries, resulting in a unique merchant account for every country and business line. This high cardinality made it challenging to monitor success rates for specific combinations like travel purchases in New Zealand, which were rare and therefore too noisy to detect changes in success rate in the same way we did for higher volume merchant accounts.

One way to manage high cardinality is to set up less granular monitoring that checks for the presence of success over a longer period of time. For instance, after an issue with the configuration at our payment gateway led to a days long outage of the New Zealand/travel combination I set up a monitor to check that there had been at least one successful transaction in the last 12 hours for the New Zealand/Travel combination. This might be slower to a alert than the finer grained alerting we had in higher volume countries but it was effective in detecting anomalies within highly granular data without leading to false positives.

Conclusion

Effective monitoring goes beyond merely watching for failures. Remember, it’s not just about what could go wrong; it’s also about ensuring what should go right, does. Your users, your team, and your bottom line will thank you.

pragmatist
Patrick Joyce