Critical server issues on UK1
Incident Report for amazee.io
Postmortem

Date: 2019-05-08 05:15 till 09.52 (4 hours 37 minutes) all times in CEST

Summary

At 05.00 on the 8th May 2019, a monitoring alert was triggered for high disk usage on uk1.compact.amazee.io. This is a frequent event due to a site using a very large amount of disk and reaching the disk limits of the cloud provider. An engineer acknowledged the alert and proceeded to clean up the disk to resolve the issue. Upon running the clean up task an incorrect removal command was run and caused the deletion of important system files. The SSH session was locked up and the engineer lost access to the machine. The issue was immediately escalated to other team members who were able to respond. After the initial triage we made the call that it would be easier to provision a new VM rather than attempt recovery on the old VM. This was completed over the next few hours and restoration of most sites was completed at 09.20.

There was a single site that was affected by corrupted database and required a restore of the database which was completed at 09.52.

Finally, solr was not cleanly installed via the puppet provisioning which meant some manual intervention was required to fully restore solr services on the node. This took longer than expected and all services were restored at 13:21.

Timeline

05.00 - Monitoring alerts about high disk usage
05.14 - Engineer runs command to clean up disk space on one of the projects
05.15 - Server becomes unstable and monitoring alerts about unavailability of sites
05.21 - Team is involved
05.34 - Additional Engineer is involved
05.34 - External Hosting partner is involved
09.30 - Handing over the case to the EU Team
09.20 - Most sites are back online
09.30 - One Database is in a corrupt state which causes issues to the entire database. This causes some shorter downtimes due to mysql restarts
09.40 - Decision to restore the site from backup
09.52 - Restore successfully finished - All sites are back online
10.30 - Work on solving the Solr issues - Sites are stable only Solr Search is still unavailable
11.30 - Work on solving the Solr issues - Sites are stable only Solr Search is still unavailable
12.15 - Involving another Engineer that can work on the Solr issues
13.16 - Solr comes back up with all indexes available
13.21 - Solr testing confirms solr fully functional

Please find the full report here https://docs.google.com/document/d/1cp_pE6SrZgBPJMeuCT5ATOLpZFRrvUD5vv4ISoWf-Ns/edit?usp=sharing

Posted 2 months ago. May 09, 2019 - 11:33 CEST

Resolved
All services are back online. We will conduct a post-mortem and share the report here on our status page.
Posted 2 months ago. May 08, 2019 - 13:33 CEST
Monitoring
All Solr indexes are back - We're conducting checks to verify all systems are functioning correctly.
Posted 2 months ago. May 08, 2019 - 13:23 CEST
Update
We're continuing the efforts on the affected Solr search indexes. We expect the remaining failing indexes to be available within the next hour.
Posted 2 months ago. May 08, 2019 - 12:52 CEST
Update
All sites are back online - we're continuing work on the solr search servers that are also impacted by the outage.
Posted 2 months ago. May 08, 2019 - 10:41 CEST
Update
Most sites have now been restored however we're still working on some ongoing issues with solr.
Posted 2 months ago. May 08, 2019 - 08:40 CEST
Update
Restoration is mostly complete but still having a few issues with restoring the solr configuration, should be completed in the next 30mins
Posted 2 months ago. May 08, 2019 - 07:38 CEST
Identified
We need to restore the server from backups, current ETA for restoration is 1 hour.
Posted 2 months ago. May 08, 2019 - 06:31 CEST
Investigating
We are currently investigating this issue.
Posted 2 months ago. May 08, 2019 - 05:24 CEST
This incident affected: United Kingdom (uk1.compact).