Partial outage of Blockchain API (RPC)

Incident Report for Reown

Postmortem

TL;DR On Nov 21 12pm CET to Nov 22 10am CET the blockchain API was partially down. Later that day, when remediating the incident, it was down for another hour.

Summary

Customers found the issue in both cases and we were internally alerted.

Root Cause On Nov 20 we had a partial outage where an RPC provider didn’t handle errors on the HTTP but JSON RPC level. One Postmortem follow up was to add logging to determine which other RPC providers do this so we prevent this issue in the future.

The logic for this ended up being flawed and resulted in an internal WARN but failed requests even with out responding HTTP.

The root cause for the issue was in the response parsing. After rolling back we rolled out the fix again, this time missing that the wrong content-type was set on the response breaking clients from properly reading the response.

What could we have done better?

This change shouldn’t have made it to production
We should have discovered this issue faster (our alarming is on the HTTP level but we weren’t responding HTTP here)
We should have been alerted to both issues before customers found out

Action items

Make sure other issues of this kind respond HTTP @Chris Smith
~~Extend integration tests to cover this type of request~~
Find a bug fix for the parsing and install the logs again ✅
Use RPC in Canary so we are alerted to such issues before customers find out e.g. an e2e UI canary for web3modal
Ensure integration tests check the content-type of the response

Posted Nov 23, 2023 - 04:49 UTC

Resolved

Fix was deployed. We will publish the postmortem shortly

Posted Nov 22, 2023 - 10:34 UTC

Identified

We _think_ we found the culprit and are rolling back.

Posted Nov 22, 2023 - 10:22 UTC

Investigating

We are currently investigating this issue.

Posted Nov 22, 2023 - 10:04 UTC

This incident affected: Blockchain API.