TL;DR
In the December 4, 2024 evening ~7PM CET we got the AWS Canary alarm about the Blockchain-RPC Arbitrum chain unavailability.
We started investigating the issue, since we have a few providers for each non-testnet chain and our load-balancing implementation should redirect the traffic to the available provider.
We have two “paid” providers for the Arbitrum: Pokt/Grove and Infura, both of them were unavailable due to rate-limiting in Infura, and HTTP 503 errors in Pokt/Grove.
We changed to redirect traffic to the one available free Publicnode provider, but because of the traffic amount we were rate-limited fast.
We have another “paid” provider with a big amount of RPC calls included (Quicknode), but there was a limit to the used endpoints and we were used it for other chains.
For the hotfix, we’ve decided to swap the less-used chain to the Arbitrum for the Quicknode provider. That hotfixed a chain RPC outage and redirected the traffic to the Quicknode provider.
Summary
Root Cause
2 of 2 of our paid providers for Arbitrum were down and the amount of traffic was too big for free RPCs:
- Infura - We had a pay-as-you go plan above 10mio calls per day - Infura switched the plan back to a fix limit without letting us know and we usually don’t hit the limit because we balance load across providers - only when Pokt went down and all load went to Infura were we affected
- Pokt/Grove - Pokt was down for about 24h. We are in touch with their team but have not receive a proper root cause. They acknowledged that 50% of all calls to Pokt were failing
5 Whys
Why was the service down for the Arbitrum chain?
- We have experienced increasing in traffic for the Arbitrum and 2 of 2 our “paid” providers who can process such amount of requests were down.
Why we are using paid providers for the Arbitrum chain?
- We have plenty amount of traffic that can be handled only by paid plans.
Why we are using Pokt and Infura for this?
- Infura has a pay-as-you-go plan so we shouldn’t be rate-limited.
- Pokt has a good RPC calls per $ ratio and can handle such an amount of traffic for less.
Why have these providers become unavailable?
- Pokt had a rebranding recently this year and after we had issues almost every month with they RPC nodes, like node unavailable or flaky responses. We still use it for other chains because it’s cheap and has plenty of RPC calls.
- Infura just forgot to reinstall our rate-limiting exception and it’s expired.
Why we don’t have more providers?
- We have another paid provider - GetBlock, but limits are too low for us and we drain it after a few days of usage and then it’s dropped out because of rate-limiting.
- We were thinking that two paid providers with high limits must be enough.
What could we have done better?
We should remove unstable providers (Pokt) and providers with low limits (GetBlock) to be sure that we are using at least two stable providers.
How can we prevent this from happening again?
- We should monitor better which providers are consistently flaky and remove them and remove providers that have too low limits for us.
- We should find and add more stable RPC providers with high limits.
Action items
- [ ] Find at least two new RPC providers with a pay-as-you-go plan and good RPC calls/$ ratio to use not just for an Arbitrum, but for all top traffic chains.
- [ ] Remove Pokt and GetBlock providers in favor of new ones.
- [ ] Add more free native RPC endpoints where available to distribute a load from paid providers a bit.