Profile Authentication Full and Later Partial Outage

Incident Report for Reown

Postmortem

Summary

Exhibit 1: AppKit API errors

Exhibit 2: Postgres/Supavisor Client Connections

We experienced a full outage of the airdrop claim page from 1220am Singapore to ~430am Singapore. The claim page was opened and rolled back 3 times and most users trying to claim within the open periods were unable to do so. It caused delaying the announcement of the airdrop claim from ~1am Singapore to ~5am Singapore. The root cause was an underprovisioned database and a bug in the connection pooler of the database provider which was mitigated by migrating to a different connection pooling provider. The issue was uncovered by internal alarms going off first but also via complaints in Telegram and other sources.

Exhibit 3: Hyperdrive errors

Thereafter we experienced a partial outage of the claim page where more than half of users got sporadic/intermittent Internal server errors when trying to log in. The partial outage lasted from ~5am Singapore to ~10am Singapore. Up to ~65% of queries were affected. The root cause was a misconfiguration of the connection pooling of the provider we migrated to as part of mitigating the full outage. The issue was first uncovered by internal metrics. A business decision was made to proceed with taking the claim live despite these errors. It was then reported by multiple customers via different channels.

Out of scope

Congrats claim 0 wct - There was a bug showing eligible users who connected a different but the recipient address of their airdrop that they received 0 WCT. This was live until ~3pm Singapore and affected a of 1.5k users. It is out of scope of this COE which focusses on the database issues and learnings thereof.
Twitter scam list - how we ended up on the Twitter list and how to make sure it won’t happen again is out of scope
Disperse - we ran into some issues on dispersing the gas delaying the launch. That’s out of scipe here too

Root Cause

Exhibit 3: registration traffic was lower than claim page traffic

Background The AppKit API powering the claim page login/profile feature is hosted on a Serverless Cloudflare worker. A best practice for database connections in such environments is to use server-side connection pooling. We used a connection pooler provided by Supabase called Supavisor for this.

Full outage The traffic to the claim page exceeded the provisioned max connections for Supavisor - see exhibit 2. We didn’t expect this to happen as we didn’t expect more traffic than during the registration phase. The connection spike also caused the CPU to spike. Vertical scaling increases the provisioned max connections. We vertically scaled twice and took the site online again causing the database to exceed connections. This is because we ran into an issue with Supavisor where despite vertical scaling the max connections didn’t increase. This is a bug in Supavisor we reported. We changed the connection pooling method when taking the site online a third time. This, as expected, exceeded the pool size quickly. Building the hypothesis was difficult as we didn’t expect the Supavisor software to be broken. The issue was mitigated by moving to Cloudflare Hyperdrive an alternate connection pooler. This required updating code and provisioning Hyperdrive which took a long time.

Exhitbit 4: 58000 Postgres errors

Partial Outage After confirming that connection pooling was no longer an issue on Hyperdrive we still had elevated amounts of HTTP5XX errors. CloudWatch errors showed a large amount 58000 error which we first analyzed as database internal errors. Later we found that the pool size on the database was too low. The pool size was 25 but we need at least 100 for Hyperdrive at peak traffic levels. The partial outage was mitigated by increasing the pool size to 128 and later to 192.

We announced the claim despite knowing there was a partial outage and took it as a business risk.

5 Whys

Why was the claim page down?

Because we exceeded the max provisioned connections.

Why did you exceed the max provisioned connections?

We did not expect more traffic than during the registration page where this setup was resilient.

Why did provisioning more connections not immediately fix the issue?

A bug Supavisor caused the actual provisioned connections to not increase.

Why did you not switch to Hyperdrive sooner?

It was difficult/risky to build/validate the hypothesis that Supavisor’s core functionality is broken.

Why did you not configure Hyperdrive correctly from the start?

We didn’t research enough about the impact of the pool size on the spot. And the error messages on Cloudflare hinted at other issues when we researched.

What went well?

tons of folks helping and on the incident call
internal metrics which we invested in heavily were great and better than what Supabase provides themselves
Our systems fired alarms immediately

What could we have done better?

We could have configured a bigger instance from the start
We could have done more research on the pool size ahead of time
The third iteration of taking the site online was redundant and had little chances of success
We could have just prophylactically increased the pool size to see what happens as an experiment
We could have analyzed more data in a related COE during registrations
Airdrop was late and the launch was at midnight for the main operator of the claim site who also suffered from food poisoning that day - we could’ve targeted a launch earlier in the day

How can we prevent this from happening again?

We understand now how many concurrent users the database can support under what settings
We will improve the query/business logic affected
We can alarm on Hyperdrive errors

Action items

Short term

analyze improvements in Cloudflare worker setup - auditing everything against best practices e.g. setting prepare=false in Drizzle - owners @Cali Armut @Celine Sarafa ✅
add Hyperdrive metrics to Cloudflare dashboard if possible and add an alarm on the errors - owner @Ben Kremer
Downscale Supabase instance to 4XL - owner @Cali Armut
Set an alarm on Supabase connections - owner @Ben Kremer
Cover Serverless Database configuration and pool size in operational readiness checklist - @Ben Kremer

Mid-term

implement query improvements from short-term analysis - owners for priortizing @Cali Armut
Properly productionize/provision the Hyperdrive setup (currently not infra as code) - owners @Cali Armut @Ben Kremer
Make sure pool size is well configured for replicas - owner @Ben Kremer

Long-term

Split Cloud database from AppKit to prevent spillover

FAQ

Did the outage cause any spillover to other systems?

Exhibit 5: Cloudflare worker availability

Yes. But the AppKit API remained overall above 99.99% - usually it’s closer to 100%.

Would increasing the pool size immediately not have resolved the Supavisor issue?

Exhibit 6: Lengthy and hard to find FAQ docs still make it vague whether Supavisor relies on or circumvents the Pool size.

Yes most likely after scaling up the DB. But given vague docs from Supabase we were under the impression that Supavisor circumvents the Pool size setting. This assumption was because it’s a software as a service product and one would assume Supabase handles this for one. That’s also how the product is marketed.

Exhibit 7: Customer support response

A lengthy customer support response is the only place that made clear to use that for Supavisor in session mode we do still need to manage the pool size.

So yes, increasing the pool size to ~192 from the default of 25 after increasing the instance to 4XL would 99% not have required us to migrate to Hyperdrive.

The bug in Supavisor described above also mislead us.

There was a similar outage during registrations - what was that again?

COE airdrop app September 25, 2024

We were on Hyperdrive and didn’t upgrade the instance and ran out of pool size. We took an action and root caused correctly. But we should have taken another action which would have been scale vertically and then we would have discovered the pool size issue probably before.

Posted Nov 29, 2024 - 15:21 UTC

Resolved

We are currently investigating this issue.

Posted Nov 26, 2024 - 16:00 UTC