TL;DR On Nov 21 12pm CET to Nov 22 10am CET the blockchain API was partially down. Later that day, when remediating the incident, it was down for another hour.
Summary
Customers found the issue in both cases and we were internally alerted.
Root Cause On Nov 20 we had a partial outage where an RPC provider didn’t handle errors on the HTTP but JSON RPC level. One Postmortem follow up was to add logging to determine which other RPC providers do this so we prevent this issue in the future.
The logic for this ended up being flawed and resulted in an internal WARN
but failed requests even with out responding HTTP.
The root cause for the issue was in the response parsing. After rolling back we rolled out the fix again, this time missing that the wrong content-type
was set on the response breaking clients from properly reading the response.
What could we have done better?
Action items
content-type
of the response