While the investigation is still ongoing as it also involves the platform vendor - We’re publishing the first learnings from the cluster-wide outage.
During the same time where the outage happened, there was also a capacity extension of our cluster being rolled out. As this task has been performed many times before without any interruption we didn’t see any issue moving this forward, as the new compute nodes only join the cluster at a very late stage in the process.
Due to the still ongoing root cause analysis, we’ve changed our operational rules for capacity extensions: