Regular Weekly Maintenance - Europe - extended
Incident Report for amazee.io
Postmortem

During weekly maintenance we updated the ch1.lagoon OpenShift Cluster to Version v3.9.57. Unfortunately this version has an unknown regression where the “subPath” functionality of VolumeMounts is not functional. This functionality is used by all mariadb-galera clusters to mount their persistent data volumes into the container.

Because of this all mariadb-galera clusters on the ch1.lagoon OpenShift where not operational anymore.

After a first analysis of the problem the maintenance and on-call engineers decided that a downgrade to the previous OpenShift Version was not possible (as it would open us up to the CVE-2018-1002105 vulnerability). Instead we decided to release a Hotfix of Lagoon which removes the usage of the “subPath” functionality in mariadb-galera clusters and brings an automated migration script: https://github.com/amazeeio/lagoon/pull/813/commits/3f4585af99ad0d00def75efb25aaeb86338d50fb

After a deployment of the affected mariadb-galera clusters, they fully bootstrapped and where operational again.

We are in contact with RedHat in order to see how such a regression was able to be released in the v3.9.57 version of OpenShift as there should be automated tests for it.

As soon as we have more information we will update this Post Mortem.

Posted Dec 19, 2018 - 23:50 UTC

Resolved
This incident has been resolved.
Posted Dec 19, 2018 - 03:42 UTC
Update
A hotfix was rolled out for the regression and the cluster is now stabilizing.
Posted Dec 19, 2018 - 02:19 UTC
Update
A regression was discovered during maintenance, we are working now to patch the issue.
Posted Dec 19, 2018 - 01:49 UTC
Monitoring
Today's maintenance is running longer than usual.
Posted Dec 19, 2018 - 01:09 UTC