Why you need observability more than tests

Here's a short and sweet story about a Friday deploy. I love Friday deploys.

Here's how it went:

We deployed an update
2min later we saw SQL error messages in our "something's wrong" slack channel
It was a distributed transaction constraint violation
We couldn't rollback because software only moves forward
5min later we shipped a reverted PR
The errors stopped
An hour later we had the full fix ready to go

We didn't ship that one though because a Friday 4:53pm deploy feels too aggressive even to me. Especially when the systems are working and it's a problem that can wait.

Why tests didn't catch this

Distributed systems problem. Code worked locally and in tests. You do operation A then B and everything is fine.

But in production sometimes B happens before A and the database goes "lol mate hold on what is this object you're referencing??"

You could write a test for this, but you might end up with one of those flaky tests that everybody hates. You know the kind – fails every 98th time, nobody knows why, and you all just ignore it. "Oh that test? Yeah that one sucks. Hit rerun and it'll be fine".

In production that 98th time happens to a user every day 😉

And even if you did write the test you'll never know if it works because your code behaves more deterministically in a test environment or because you accurately captured all the nuance of a live production environment.

How observability did catch it, fast

It's easy. We send all error logs to a central location where they are observed by robots. When errors talk about SQL, we send them to slack as a warning. If there are lots, we trigger a proper alert that wakes people up.

We're using OTEL integrated into our python logger. Anyone can hook into this infrastructure with a current_app.logger.debug/info/warn/error. Default error handling is already instrumented so you don't need to think about it.

Same ability exists on the client side in JavaScript.

Key to making this useful is:

default instrumentation for defaults
low friction to add new logs, traces, or spans
easy search through all this data (we use Sumologic)
anyone can make a self-serve alert to observe their code

Crucially, you don't need to deploy code to make a new alert or dashboard. As long as the events are there, you can start observing anything that you think is causing problems.

And then you can fix 'em :)

Cheers,
~Swizec