On May 3rd, 2023 1246pm-1515pm UAE time end-users experienced full or partial issues connecting to the Relay. This was caused by a database running out of memory. Resolving this issue took too long as the team was mislead into believing it was a CPU boundness issue not memory. There were monitors on the memory of the database.
Database ran out of memory due to regular traffic spikes.
On May 3rd, 2023 1246pm-1515pm UAE time end-users experienced full or partial issues connecting to the Relay.
Alarms went off and while the team was already investigating on-call, who was in the US at 446am, also informed the team and started investigating.
Team concluded quickly that it’s likely CPU pressure as CPU was at 100% and there had been some CPU boundness concerns before on that particular database.
Team prioritized quick code fixes to cause less DB reads and scaled database horizontally.
Rolling out these changes took >1h. Team meanwhile tried to further root cause unsuccessfully.
Team later found that memory wasn’t monitored and that memory and throttling due to low memory caused the issue.
Team isn’t sure why memory throttling causes CPU spikes in particular but team is actively deprecating said DB so will not dive into details.
Immediate
Mid-term
Long-term