ch1.lagoon - OpenShift Upgrade
Incident Report for amazee.io
Resolved
Upgrade is complete, the cluster is showing stable for the last few hours.
Posted 12 days ago. Oct 06, 2018 - 04:16 CEST
Monitoring
We have completed applying updates and doing the final reboots of all systems. We will monitor the cluster for a while longer before calling this emergency maintenance concluded.
Posted 12 days ago. Oct 06, 2018 - 02:08 CEST
Update
All compute node updates have been finished - Based on monitoring all sites are back online again. We're conducting the post upgrade checks at the moment.
Posted 12 days ago. Oct 05, 2018 - 23:20 CEST
Update
We will continue with the OpenShift upgrade of the compute nodes
Posted 12 days ago. Oct 05, 2018 - 19:58 CEST
Update
Control Plane update is fully done now and the cluster is stable. We'll move on to upgrading the compute nodes after 8pm CEST in order to move the emergency maintenance further out of business hours.
Posted 12 days ago. Oct 05, 2018 - 18:09 CEST
Update
We're conducting checks after the control plane upgrade and assess on when to continue with the upgrade.
Posted 12 days ago. Oct 05, 2018 - 16:37 CEST
Identified
Sites start to come back online. We're monitoring the situation.
Posted 12 days ago. Oct 05, 2018 - 16:13 CEST
Investigating
We're continuing to investigate the issue that causes the routes to fail.
Posted 12 days ago. Oct 05, 2018 - 16:01 CEST
Update
We're seeing the issues with sites being down. We're investigating.
Posted 12 days ago. Oct 05, 2018 - 15:57 CEST
Update
The control plane upgrade looks good. There could be intermittent deployment errors during the upgrade. Get in touch with our engineers and we'll be looking into the issue.
Posted 12 days ago. Oct 05, 2018 - 15:44 CEST
Update
We found a small issue during the control plane upgrade which has been fixed - carrying on with the control plane upgrade.

As soon as we finish this part we'll carry on and update the compute nodes. We'll update here as soon as this will start.

During the compute node upgrade there might happen short downtimes due to the restart of the containers.
Posted 12 days ago. Oct 05, 2018 - 14:51 CEST
Update
Control Plane update is still running - We'll update as soon as new information becomes available.
Posted 12 days ago. Oct 05, 2018 - 13:38 CEST
Update
Control Plane update is currently running. We expect small <1min downtimes and will take action if needed.
Posted 12 days ago. Oct 05, 2018 - 12:35 CEST
Identified
Pre-maintenance tasks have been completed sucessfully.

We plan to start the first part of the OpenShift upgrade momentarily.
Downtimes should be minimal (< 1 minute) . Our engineers are monitoring the situation closely and take action if needed.
Posted 12 days ago. Oct 05, 2018 - 11:30 CEST
Investigating
amazee.io had planned to upgrade the ch1.lagoon cluster on October 11, 2018. However, due to persistent issues being experienced on the cluster, we have decided to trigger an emergency maintenance window, and begin upgrading the cluster immediately. We believe that this will bring stability to the cluster, and resolve the issues being experienced by a number of our customers.

Our team is currently polishing the plans for the upgrade, and will post a follow up as soon as we have solidified the strategy and timeline for the upgrade.

We are sorry for any inconvenience this may cause, and would like to assure our customers that we are doing everything within our power to keep downtime to an absolute minimum.

If you need any further information, our engineering team are available on Slack & Rocketchat. You are also welcome to reach out to us via email using support@amazee.io.
Posted 12 days ago. Oct 05, 2018 - 11:19 CEST
This incident affected: Switzerland (ch1.lagoon).