Date: 2019-05-08 05:15 till 09.52 (4 hours 37 minutes) all times in CEST
At 05.00 on the 8th May 2019, a monitoring alert was triggered for high disk usage on uk1.compact.amazee.io. This is a frequent event due to a site using a very large amount of disk and reaching the disk limits of the cloud provider. An engineer acknowledged the alert and proceeded to clean up the disk to resolve the issue. Upon running the clean up task an incorrect removal command was run and caused the deletion of important system files. The SSH session was locked up and the engineer lost access to the machine. The issue was immediately escalated to other team members who were able to respond. After the initial triage we made the call that it would be easier to provision a new VM rather than attempt recovery on the old VM. This was completed over the next few hours and restoration of most sites was completed at 09.20.
There was a single site that was affected by corrupted database and required a restore of the database which was completed at 09.52.
Finally, solr was not cleanly installed via the puppet provisioning which meant some manual intervention was required to fully restore solr services on the node. This took longer than expected and all services were restored at 13:21.
05.00 - Monitoring alerts about high disk usage
05.14 - Engineer runs command to clean up disk space on one of the projects
05.15 - Server becomes unstable and monitoring alerts about unavailability of sites
05.21 - Team is involved
05.34 - Additional Engineer is involved
05.34 - External Hosting partner is involved
09.30 - Handing over the case to the EU Team
09.20 - Most sites are back online
09.30 - One Database is in a corrupt state which causes issues to the entire database. This causes some shorter downtimes due to mysql restarts
09.40 - Decision to restore the site from backup
09.52 - Restore successfully finished - All sites are back online
10.30 - Work on solving the Solr issues - Sites are stable only Solr Search is still unavailable
11.30 - Work on solving the Solr issues - Sites are stable only Solr Search is still unavailable
12.15 - Involving another Engineer that can work on the Solr issues
13.16 - Solr comes back up with all indexes available
13.21 - Solr testing confirms solr fully functional
Please find the full report here https://docs.google.com/document/d/1cp_pE6SrZgBPJMeuCT5ATOLpZFRrvUD5vv4ISoWf-Ns/edit?usp=sharing