CH1 - Outage

Incident Report for amazee.io

Postmortem

While the investigation is still ongoing as it also involves the platform vendor - We’re publishing the first learnings from the cluster-wide outage.

During the same time where the outage happened, there was also a capacity extension of our cluster being rolled out. As this task has been performed many times before without any interruption we didn’t see any issue moving this forward, as the new compute nodes only join the cluster at a very late stage in the process.

Due to the still ongoing root cause analysis, we’ve changed our operational rules for capacity extensions:

Adding additional nodes to this cluster will be done during a maintenance window or evening so no potential interruptions can occur.
Improvements in the process of the capacity management - we’ll implement this in our weekly processes to check on the capacity and be more proactive with scaling operations

Posted Oct 21, 2019 - 09:54 UTC

Resolved

The incident has been resolved - We'll be publishing a post-mortem as soon as we have all information together . Currently we're working on the post mortem for this incident.

Posted Oct 10, 2019 - 15:01 UTC

Monitoring

We're monitoring the further recovery of the cluster. So far all sites remain online without issues. If you still encounter issues get back to us and we'll looking into the issues.

Posted Oct 10, 2019 - 13:27 UTC

Update

All sites are back online - we're continuing with work on stabilizing the cluster. We lifted the severity to degraded performance as slower pod start times could be expected till the cluster has fully stabilized.

Posted Oct 10, 2019 - 10:45 UTC

Update

We're still working on solving the issues at hand - most of the sites are back. Our team is working on getting the remaining issues fixed and all sites fully back online.

Posted Oct 10, 2019 - 10:34 UTC

Identified

We are continuing to work on the issues at hand. Some sites already recovered and we're working on getting the remaining sites online as soon as possible.

Posted Oct 10, 2019 - 09:50 UTC

Update

We're continuing to investigate the issue and see that some sites start to recover

Posted Oct 10, 2019 - 08:54 UTC

Investigating

We're currently investigating availability issues on ch1.lagoon

Posted Oct 10, 2019 - 08:47 UTC