Why you need observability more than tests
Here's a short and sweet story about a Friday deploy. I love Friday deploys.
Here's how it went:
- We deployed an update
- 2min later we saw SQL error messages in our "something's wrong" slack channel
- It was a distributed transaction constraint violation
- We couldn't rollback because software only moves forward
- 5min later we shipped a reverted PR
- The errors stopped
- An hour later we had the full fix ready to go
We didn't ship that one though because a Friday 4:53pm deploy feels too aggressive even to me. Especially when the systems are working and it's a problem that can wait.
Why tests didn't catch this
Distributed systems problem. Code worked locally and in tests. You do operation A then B and everything is fine.
But in production sometimes B happens before A and the database goes "lol mate hold on what is this object you're referencing??"
You could write a test for this, but you might end up with one of those flaky tests that everybody hates. You know the kind – fails every 98th time, nobody knows why, and you all just ignore it. "Oh that test? Yeah that one sucks. Hit rerun and it'll be fine".
In production that 98th time happens to a user every day 😉
And even if you did write the test you'll never know if it works because your code behaves more deterministically in a test environment or because you accurately captured all the nuance of a live production environment.
How observability did catch it, fast
It's easy. We send all error logs to a central location where they are observed by robots. When errors talk about SQL, we send them to slack as a warning. If there are lots, we trigger a proper alert that wakes people up.
We're using OTEL integrated into our python logger. Anyone can hook into this infrastructure with a current_app.logger.debug/info/warn/error. Default error handling is already instrumented so you don't need to think about it.
Same ability exists on the client side in JavaScript.
Key to making this useful is:
- default instrumentation for defaults
- low friction to add new logs, traces, or spans
- easy search through all this data (we use Sumologic)
- anyone can make a self-serve alert to observe their code
Crucially, you don't need to deploy code to make a new alert or dashboard. As long as the events are there, you can start observing anything that you think is causing problems.
And then you can fix 'em :)
Cheers,
~Swizec