Regular Weekly Maintenance
Scheduled Maintenance Report for amazee.io
Postmortem

Date: 2017-09-13
Time: 20:30 UTC - 21:32 UTC

Summary

During our regular amazee.io maintenance windows the rollout of a new release of the 3rd generation hosting platform an unforeseen situation caused the public_html symlink of production deployments to be removed and with that the sites not reachable (read more about our deployment system here: https://docs.amazee.io/automated_deployments.html). No data loss occurred.

Timeline

2017-09-13 20:00 UTC
Start of regular weekly maintenance window

2017-09-13 20:17 UTC
New release of hosting platform merged into production environment and started to roll out across all servers.

2017-09-13 20:30 UTC
During monitoring of roll out, amazee.io engineers see first sites not accessible anymore. amazee.io team decides to immediately stop the rollout of the code and assess the situation via a taskforce video call.

2017-09-13 20:45 UTC
Situation has been analyzed and showed that the newest code of the hosting platform had a case where it was possible that public_html symlink were unintentionally removed and replaced with an empty directory. The team started to implement a script that will recreate symlinks of the public_html folders for each site.

2017-09-13 20:53 UTC
Maintenance page updated that maintenance will take longer than usual.

2017-09-13 21:13 UTC
The amazee.io team implemented a script that recreates the symlink and has been tested on test servers for it’s full functionality. Team starts to run the symlink script on all servers.

2017-09-13 21:32 UTC
Symlink script executed on all servers and all symlinks restored

2017-09-13 21:44 UTC
Monitoring showed that some sites appeared to not be back online. After short research it was found that Varnish had cached the error in these cases. Team decided that all varnish caches should be cleared. All sites are fully working again.

2017-09-13 21:53 UTC
As the rollout is still halted, the amazee.io team searches and finds the cause of the removal of the symlink and fixes the code in the hosting platform.

2017-09-13 22:08 UTC
New code is tested on several test servers and sites and found fully functioning. Started to roll out on all servers.

2017-09-13 22:21 UTC
Newest code rolled out on all servers. amazee.io team tests full functioning of hosting platform with test deployments and finds the platform to be fully stable and working again.

Actions & Mitigations

Issue Status / Action Taken
Not found case of public_html symlink removal during development & testing amazee.io develops and test every new code on dedicated servers for testing and does code reviews with 4 eye check by the team. Unfortunately no process is perfect and it’s possible for bugs to slip through. We will adapt our testing process with more rigorous testing and edge case finding procedures.
Maintenance page updated late As our platform was already in maintenance mode and with that the status page showing a maintenance window, we did not consider to update the maintenance page when the first websites did not work anymore. Also at that time we were not sure how wide across our platform the symlink issue caused issues. Only a couple of minutes after the issue occured we updated the maintenance page to state that during this maintenance window the sites are not reachable longer than usual. We updated our processes to update the status page immediately even if we are in a maintenance window already.

Further information

If you like further information about this incident, please don't hesitate to contact us at support@amazee.io or directly via Slack.

Posted Sep 13, 2017 - 08:41 UTC

Completed
Maintenance has concluded. We will publish a post-mortem report within the next couple of hours after we can debrief the incident together with our team.
Posted Sep 12, 2017 - 22:25 UTC
Update
During a feature rollout a few sites were affected by a bug which caused temporary unavailability. All affected sites have been checked and should be back online. If you still see unavailability please get back to us in slack or via support@amazee.io
Posted Sep 12, 2017 - 21:32 UTC
Update
Maintenance is taking longer than anticipated.
Posted Sep 12, 2017 - 20:53 UTC
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Sep 12, 2017 - 20:00 UTC
Scheduled
We are conducting regular maintenance on our systems.
Posted Sep 12, 2017 - 17:46 UTC