Time: 20:30 UTC - 21:32 UTC
During our regular amazee.io maintenance windows the rollout of a new release of the 3rd generation hosting platform an unforeseen situation caused the
public_html symlink of production deployments to be removed and with that the sites not reachable (read more about our deployment system here: https://docs.amazee.io/automated_deployments.html). No data loss occurred.
2017-09-13 20:00 UTC
Start of regular weekly maintenance window
2017-09-13 20:17 UTC
New release of hosting platform merged into production environment and started to roll out across all servers.
2017-09-13 20:30 UTC
During monitoring of roll out, amazee.io engineers see first sites not accessible anymore. amazee.io team decides to immediately stop the rollout of the code and assess the situation via a taskforce video call.
2017-09-13 20:45 UTC
Situation has been analyzed and showed that the newest code of the hosting platform had a case where it was possible that
public_html symlink were unintentionally removed and replaced with an empty directory.
The team started to implement a script that will recreate symlinks of the
public_html folders for each site.
2017-09-13 20:53 UTC
Maintenance page updated that maintenance will take longer than usual.
2017-09-13 21:13 UTC
The amazee.io team implemented a script that recreates the symlink and has been tested on test servers for it’s full functionality. Team starts to run the symlink script on all servers.
2017-09-13 21:32 UTC
Symlink script executed on all servers and all symlinks restored
2017-09-13 21:44 UTC
Monitoring showed that some sites appeared to not be back online. After short research it was found that Varnish had cached the error in these cases. Team decided that all varnish caches should be cleared. All sites are fully working again.
2017-09-13 21:53 UTC
As the rollout is still halted, the amazee.io team searches and finds the cause of the removal of the symlink and fixes the code in the hosting platform.
2017-09-13 22:08 UTC
New code is tested on several test servers and sites and found fully functioning. Started to roll out on all servers.
2017-09-13 22:21 UTC
Newest code rolled out on all servers. amazee.io team tests full functioning of hosting platform with test deployments and finds the platform to be fully stable and working again.
|Issue||Status / Action Taken|
|Not found case of
||amazee.io develops and test every new code on dedicated servers for testing and does code reviews with 4 eye check by the team. Unfortunately no process is perfect and it’s possible for bugs to slip through. We will adapt our testing process with more rigorous testing and edge case finding procedures.|
|Maintenance page updated late||As our platform was already in maintenance mode and with that the status page showing a maintenance window, we did not consider to update the maintenance page when the first websites did not work anymore. Also at that time we were not sure how wide across our platform the symlink issue caused issues. Only a couple of minutes after the issue occured we updated the maintenance page to state that during this maintenance window the sites are not reachable longer than usual. We updated our processes to update the status page immediately even if we are in a maintenance window already.|
If you like further information about this incident, please don't hesitate to contact us at email@example.com or directly via Slack.