CH1 Outage

Incident Report for amazee.io

Postmortem

Date: 2018-02-23
Time: 12:07 CET - 13:22 CET (77min)

Summary

During a regular update of the CH1 OpenShift infrastructure, an unforeseen side effect which did not occur in the test infrastructure caused internal traffic between the OpenShift nodes to be disrupted. This caused most websites running on this OpenShift to be rendered unavailable or with slow load times. During the analysis of the issue, the logs of the load balancers showed a warning which was first assumed to be the issue and was resolved. This proved to not solve the issue and further analysis was necessary. The causing issue (changing sysctl config) was then found and fixed, this restored full functionality of the affected websites.

Timeline

09:00 CET - Update to OpenShift infrastructure is rolled out on test infrastructure and successfully tested
12:01 CET - Update is rolled out to production infrastructure
12:07 CET - First notifications of monitoring showing inoperable websites
12:08 CET - Immediate stop of rollout of infrastructure updates to prevent further issues with other OpenShift by on-duty system engineers
12:15 CET - First analysis shows warnings on Load Balancers, system engineers decide that this seems to be the best probable cause and the issue is resolved by removing malfunctioning OpenShift Services
12:25 CET - Removal of OpenShift Services does remove the warning, but does not solve the problem of inoperable websites
12:35 CET - Escalating of issue to additional system engineers. Forming of Task Force with 4 system engineers in chat. Further analysis of possible cause.
12:53 CET - Discovery of additional configurations within sysctl on affected OpenShift Nodes, while not affected Nodes do not have these configurations.
12:59 CET - After a group discussion, the decision to hot-remove the additional configurations which shows to solve the issue
13:09 CET - Decision to hot-remove these configurations on all affected OpenShift Nodes
13:22 CET - Monitoring shows affected website fully working again and all issues resolved
13:29 CET - Implementation of a persistent fix of affected configurations and rollout on all OpenShift Nodes

Actions & Mitigations

Issue	Status / Action Taken
A rollout of changes outside of regular maintenance windows.	In order to ensure proper services and performance across all time zones and infrastructures, we need to roll out small changes to the infrastructure outside of regular maintenance windows. With proper testing infrastructure and automated tests, we are able to reduce side-effect. If such a test shows any impact on production infrastructure, the change is scheduled for maintenance windows. In this specific case, the test infrastructure did not show such an impact.
Testing infrastructure did not show the same side-effect like the production infrastructure.	We are running the exact same infrastructure for testing and production to reduce the chance of unforeseen side-effects to an absolute minimum. Unfortunately, in this situation, the side-effect was caused by a factor of human error and older legacy servers in the production infrastructure, which need changes to sysctl which are poisonous to the newer infrastructure. We are in the progress to sunset this legacy servers, which should be completed in the next couple of months. During this time we started the legacy servers also on the testing infrastructure in order to have congruent testing systems.

Issue

Status / Action Taken

A rollout of changes outside of regular maintenance windows.

In order to ensure proper services and performance across all time zones and infrastructures, we need to roll out small changes to the infrastructure outside of regular maintenance windows. With proper testing infrastructure and automated tests, we are able to reduce side-effect. If such a test shows any impact on production infrastructure, the change is scheduled for maintenance windows. In this specific case, the test infrastructure did not show such an impact.

Testing infrastructure did not show the same side-effect like the production infrastructure.

We are running the exact same infrastructure for testing and production to reduce the chance of unforeseen side-effects to an absolute minimum. Unfortunately, in this situation, the side-effect was caused by a factor of human error and older legacy servers in the production infrastructure, which need changes to sysctl which are poisonous to the newer infrastructure. We are in the progress to sunset this legacy servers, which should be completed in the next couple of months. During this time we started the legacy servers also on the testing infrastructure in order to have congruent testing systems.

Posted Feb 28, 2018 - 14:49 UTC

Resolved

The issue has been resolved, we are conducting a thorough review and will send the affected customers a port mortem.

Posted Feb 23, 2018 - 14:44 UTC

Monitoring

All services are fully up and running again, we are actively monitoring the situation.

Posted Feb 23, 2018 - 12:23 UTC

Identified

We have identified the problem and are rollout out a fix.

Posted Feb 23, 2018 - 12:12 UTC

Investigating

We're seeing a witespread outage on our CH1 platform. We'll update the incident as new Information becomes available

Posted Feb 23, 2018 - 11:16 UTC