TL;DR On Nov 20 2023 from 1pm CET to 3pm CET our ENS resolver returned a static name.
Summary
Customers discovered the issue and operator initiated a swift rollback after the issue was discovered.
Root Cause A bug was deployed to production.
5 Whys
Why was a bug deployed to production?
We occasionally make mistakes. We know this and hence and try to catch these in lower stages.
Why was the bug not discovered in lower stages?
The name service feature was not covered in integration tests.
Why was the feature not covered in integration tests?
We never prioritized.
Why was this never prioritized?
The mix of effort and fun to write of the tests didn’t immediately check out.
Why did it not check out?
The integration test suite is difficult to maintain.
What could we have done better?
Action items