Critical server issues on UK1

Incident Report for amazee.io

Postmortem

Date: 2019-05-08 05:15 till 09.52 (4 hours 37 minutes) all times in CEST

Summary

At 05.00 on the 8th May 2019, a monitoring alert was triggered for high disk usage on uk1.compact.amazee.io. This is a frequent event due to a site using a very large amount of disk and reaching the disk limits of the cloud provider. An engineer acknowledged the alert and proceeded to clean up the disk to resolve the issue. Upon running the clean up task an incorrect removal command was run and caused the deletion of important system files. The SSH session was locked up and the engineer lost access to the machine. The issue was immediately escalated to other team members who were able to respond. After the initial triage we made the call that it would be easier to provision a new VM rather than attempt recovery on the old VM. This was completed over the next few hours and restoration of most sites was completed at 09.20.

There was a single site that was affected by corrupted database and required a restore of the database which was completed at 09.52.

Finally, solr was not cleanly installed via the puppet provisioning which meant some manual intervention was required to fully restore solr services on the node. This took longer than expected and all services were restored at 13:21.

Timeline

05.00 - Monitoring alerts about high disk usage
05.14 - Engineer runs command to clean up disk space on one of the projects
05.15 - Server becomes unstable and monitoring alerts about unavailability of sites
05.21 - Team is involved
05.34 - Additional Engineer is involved
05.34 - External Hosting partner is involved
09.30 - Handing over the case to the EU Team
09.20 - Most sites are back online
09.30 - One Database is in a corrupt state which causes issues to the entire database. This causes some shorter downtimes due to mysql restarts
09.40 - Decision to restore the site from backup
09.52 - Restore successfully finished - All sites are back online
10.30 - Work on solving the Solr issues - Sites are stable only Solr Search is still unavailable
11.30 - Work on solving the Solr issues - Sites are stable only Solr Search is still unavailable
12.15 - Involving another Engineer that can work on the Solr issues
13.16 - Solr comes back up with all indexes available
13.21 - Solr testing confirms solr fully functional

Please find the full report here https://docs.google.com/document/d/1cp_pE6SrZgBPJMeuCT5ATOLpZFRrvUD5vv4ISoWf-Ns/edit?usp=sharing

Posted May 09, 2019 - 09:33 UTC

Resolved

All services are back online. We will conduct a post-mortem and share the report here on our status page.

Posted May 08, 2019 - 11:33 UTC

Monitoring

All Solr indexes are back - We're conducting checks to verify all systems are functioning correctly.

Posted May 08, 2019 - 11:23 UTC

Update

We're continuing the efforts on the affected Solr search indexes. We expect the remaining failing indexes to be available within the next hour.

Posted May 08, 2019 - 10:52 UTC

Update

All sites are back online - we're continuing work on the solr search servers that are also impacted by the outage.

Posted May 08, 2019 - 08:41 UTC

Update

Most sites have now been restored however we're still working on some ongoing issues with solr.

Posted May 08, 2019 - 06:40 UTC

Update

Restoration is mostly complete but still having a few issues with restoring the solr configuration, should be completed in the next 30mins

Posted May 08, 2019 - 05:38 UTC

Identified

We need to restore the server from backups, current ETA for restoration is 1 hour.

Posted May 08, 2019 - 04:31 UTC

Investigating

We are currently investigating this issue.

Posted May 08, 2019 - 03:24 UTC