Service degradation

Incident Report for Reown

Postmortem

Relay Outage Due to DB out of Memory

On May 3rd, 2023 1246pm-1515pm UAE time end-users experienced full or partial issues connecting to the Relay. This was caused by a database running out of memory. Resolving this issue took too long as the team was mislead into believing it was a CPU boundness issue not memory. There were monitors on the memory of the database.

Root Cause

Database ran out of memory due to regular traffic spikes.

What Happened

On May 3rd, 2023 1246pm-1515pm UAE time end-users experienced full or partial issues connecting to the Relay.

Alarms went off and while the team was already investigating on-call, who was in the US at 446am, also informed the team and started investigating.

Team concluded quickly that it’s likely CPU pressure as CPU was at 100% and there had been some CPU boundness concerns before on that particular database.

Team prioritized quick code fixes to cause less DB reads and scaled database horizontally.

Rolling out these changes took >1h. Team meanwhile tried to further root cause unsuccessfully.

Team later found that memory wasn’t monitored and that memory and throttling due to low memory caused the issue.

Team isn’t sure why memory throttling causes CPU spikes in particular but team is actively deprecating said DB so will not dive into details.

What went well

Some alarms fired and on-call plus many Rust team members hopped on a call
Team was aware of issue before customers
Immediate actions were taken to scale DB cluster horizontally
Immediate coding actions were taken
Customers were informed via Status page

What didn’t go well

Team missed basic DB memory monitors which would have helped identify the issue faster
Team was mislead by data supporting preconceived hypothesis

Action items

Immediate
- Scale DB cluster vertically - done
- Add memory monitoring - done
Mid-term
- Add DB profiling
- Investigate DB autoscaling
Long-term
- Remove DB polling feature for less CPU pressure
- Batch queries for less DB pressure
- Introduce stricter internal constraints

Posted May 03, 2023 - 16:05 UTC

Resolved

This incident has been resolved.

Posted May 03, 2023 - 11:40 UTC

Update

We are continuing to monitor for any further issues.

Posted May 03, 2023 - 11:31 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 03, 2023 - 11:30 UTC

Update

We are continuing to investigate this issue.

Posted May 03, 2023 - 09:16 UTC

Investigating

We are currently investigating this issue.

Posted May 03, 2023 - 09:16 UTC

This incident affected: Relay.