Service degradation
Incident Report for Reown
Postmortem

Relay Outage Due to DB out of Memory

On May 3rd, 2023 1246pm-1515pm UAE time end-users experienced full or partial issues connecting to the Relay. This was caused by a database running out of memory. Resolving this issue took too long as the team was mislead into believing it was a CPU boundness issue not memory. There were monitors on the memory of the database.

Root Cause

Database ran out of memory due to regular traffic spikes.

What Happened

On May 3rd, 2023 1246pm-1515pm UAE time end-users experienced full or partial issues connecting to the Relay.

Alarms went off and while the team was already investigating on-call, who was in the US at 446am, also informed the team and started investigating.

Team concluded quickly that it’s likely CPU pressure as CPU was at 100% and there had been some CPU boundness concerns before on that particular database.

Team prioritized quick code fixes to cause less DB reads and scaled database horizontally.

Rolling out these changes took >1h. Team meanwhile tried to further root cause unsuccessfully.

Team later found that memory wasn’t monitored and that memory and throttling due to low memory caused the issue.

Team isn’t sure why memory throttling causes CPU spikes in particular but team is actively deprecating said DB so will not dive into details.

What went well

  • Some alarms fired and on-call plus many Rust team members hopped on a call
  • Team was aware of issue before customers
  • Immediate actions were taken to scale DB cluster horizontally
  • Immediate coding actions were taken
  • Customers were informed via Status page

What didn’t go well

  • Team missed basic DB memory monitors which would have helped identify the issue faster
  • Team was mislead by data supporting preconceived hypothesis

Action items

  • Immediate

    • Scale DB cluster vertically - done
    • Add memory monitoring - done
  • Mid-term

    • Add DB profiling
    • Investigate DB autoscaling
  • Long-term

    • Remove DB polling feature for less CPU pressure
    • Batch queries for less DB pressure
    • Introduce stricter internal constraints
Posted May 03, 2023 - 16:05 UTC

Resolved
This incident has been resolved.
Posted May 03, 2023 - 11:40 UTC
Update
We are continuing to monitor for any further issues.
Posted May 03, 2023 - 11:31 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 03, 2023 - 11:30 UTC
Update
We are continuing to investigate this issue.
Posted May 03, 2023 - 09:16 UTC
Investigating
We are currently investigating this issue.
Posted May 03, 2023 - 09:16 UTC
This incident affected: Relay.