ch1-lagoon - Unavailability on certain sites.
Incident Report for amazee.io
Postmortem

In the process of adding another node to the Amazee CH1 OpenShift cluster an outdated X.509 certificate was used by mistake (a node with the same name was part of the cluster for a few weeks in 2018). Shortly after adding the node to the cluster, monitoring reported the wrong certificate. During the recovery effort one other compute node was impacted and all pods scheduled on that node were forcefully terminated, leading to brief application outage of websites running on that node. Redeploying the certificates in question on all nodes resolved the issue.

To avoid such issues from recurring in the future we will ensure not to reuse hostnames.

Posted Apr 03, 2019 - 15:09 UTC

Resolved
This incident has been resolved.
Posted Apr 03, 2019 - 12:11 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 03, 2019 - 11:46 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 03, 2019 - 11:27 UTC
Investigating
We are currently investigating this issue.
Posted Apr 03, 2019 - 11:22 UTC