Exhibit 1: AppKit API errors
Exhibit 2: Postgres/Supavisor Client Connections
We experienced a full outage of the airdrop claim page from 1220am Singapore to ~430am Singapore. The claim page was opened and rolled back 3 times and most users trying to claim within the open periods were unable to do so. It caused delaying the announcement of the airdrop claim from ~1am Singapore to ~5am Singapore. The root cause was an underprovisioned database and a bug in the connection pooler of the database provider which was mitigated by migrating to a different connection pooling provider. The issue was uncovered by internal alarms going off first but also via complaints in Telegram and other sources.
Exhibit 3: Hyperdrive errors
Thereafter we experienced a partial outage of the claim page where more than half of users got sporadic/intermittent Internal server errors
when trying to log in. The partial outage lasted from ~5am Singapore to ~10am Singapore. Up to ~65% of queries were affected. The root cause was a misconfiguration of the connection pooling of the provider we migrated to as part of mitigating the full outage. The issue was first uncovered by internal metrics. A business decision was made to proceed with taking the claim live despite these errors. It was then reported by multiple customers via different channels.
Out of scope
recipient address
of their airdrop that they received 0 WCT. This was live until ~3pm Singapore and affected a of 1.5k users. It is out of scope of this COE which focusses on the database issues and learnings thereof.Exhibit 3: registration traffic was lower than claim page traffic
Background The AppKit API powering the claim page login/profile feature is hosted on a Serverless Cloudflare worker. A best practice for database connections in such environments is to use server-side connection pooling. We used a connection pooler provided by Supabase called Supavisor for this.
Full outage The traffic to the claim page exceeded the provisioned max connections for Supavisor - see exhibit 2. We didn’t expect this to happen as we didn’t expect more traffic than during the registration phase. The connection spike also caused the CPU to spike. Vertical scaling increases the provisioned max connections. We vertically scaled twice and took the site online again causing the database to exceed connections. This is because we ran into an issue with Supavisor where despite vertical scaling the max connections didn’t increase. This is a bug in Supavisor we reported. We changed the connection pooling method when taking the site online a third time. This, as expected, exceeded the pool size quickly. Building the hypothesis was difficult as we didn’t expect the Supavisor software to be broken. The issue was mitigated by moving to Cloudflare Hyperdrive an alternate connection pooler. This required updating code and provisioning Hyperdrive which took a long time.
Exhitbit 4: 58000
Postgres errors
Partial Outage After confirming that connection pooling was no longer an issue on Hyperdrive we still had elevated amounts of HTTP5XX errors. CloudWatch errors showed a large amount 58000
error which we first analyzed as database internal errors. Later we found that the pool size on the database was too low. The pool size was 25
but we need at least 100
for Hyperdrive at peak traffic levels. The partial outage was mitigated by increasing the pool size to 128
and later to 192
.
We announced the claim despite knowing there was a partial outage and took it as a business risk.
Why was the claim page down?
Because we exceeded the max provisioned connections.
Why did you exceed the max provisioned connections?
We did not expect more traffic than during the registration page where this setup was resilient.
Why did provisioning more connections not immediately fix the issue?
A bug Supavisor caused the actual provisioned connections to not increase.
Why did you not switch to Hyperdrive sooner?
It was difficult/risky to build/validate the hypothesis that Supavisor’s core functionality is broken.
Why did you not configure Hyperdrive correctly from the start?
We didn’t research enough about the impact of the pool size on the spot. And the error messages on Cloudflare hinted at other issues when we researched.
Short term
prepare=false
in Drizzle - owners @Cali Armut @Celine Sarafa ✅4XL
- owner @Cali ArmutMid-term
pool size
is well configured for replicas - owner @Ben KremerLong-term
Exhibit 5: Cloudflare worker availability
Yes. But the AppKit API remained overall above 99.99% - usually it’s closer to 100%.
Exhibit 6: Lengthy and hard to find FAQ docs still make it vague whether Supavisor relies on or circumvents the Pool size
.
Yes most likely after scaling up the DB. But given vague docs from Supabase we were under the impression that Supavisor circumvents the Pool size
setting. This assumption was because it’s a software as a service product and one would assume Supabase handles this for one. That’s also how the product is marketed.
Exhibit 7: Customer support response
A lengthy customer support response is the only place that made clear to use that for Supavisor in session mode
we do still need to manage the pool size
.
So yes, increasing the pool size
to ~192 from the default of 25 after increasing the instance to 4XL
would 99% not have required us to migrate to Hyperdrive.
The bug in Supavisor described above also mislead us.
COE airdrop app September 25, 2024
We were on Hyperdrive and didn’t upgrade the instance and ran out of pool size. We took an action and root caused correctly. But we should have taken another action which would have been scale vertically and then we would have discovered the pool size
issue probably before.