ch1-lagoon - Instabilities
Incident Report for amazee.io
Resolved
After implementing additional mitigations on an infrastructure level on last Friday, our monitoring has not encountered any instabilities anymore. We therefore conduct that the issues have been resolved.
Posted Sep 17, 2018 - 08:46 UTC
Update
We continued monitoring the instability issues in the last 24h. The implemented fix resolved most of the instabilities, but unfortunately there are still some small instabilities for a 20 seconds every couple of hours.
Therefore we conducted another all-hands meeting with all involved parties (amazee.io, Hosting Partner, Infrastructure Partner) and implemented some additional monitoring on the infrastructure.
This monitoring allowed us to learn which virtual machine is the root cause of the issue and we are investigating what exactly causes the issue on that machine.

We will update as soon as we know more or the instabilities are fully resolved.
Posted Sep 14, 2018 - 11:33 UTC
Update
We are continuing to monitor for any further issues.
Posted Sep 13, 2018 - 11:52 UTC
Update
During the course of the day we implemented a few changes on the infrastructure to further stabilize the situation. Since around 16:30 CEST the connection issues have ceased. We continue to monitor the situation closely.
Posted Sep 12, 2018 - 19:35 UTC
Update
Our engineers found some irregularities in the network stack today. We're restarting all machines during the maintenance window and check back with the infrastructure provider if the issues are gone. So far the situation should be more stable since the late afternoon as we started to implement another fix. We'll update this incident as soon as new information becomes available.

Currently we also plan to get call in a all hands meeting with all involved parties to discuss the issue at hand tomorrow morning September 12 - Morning CEST.
Posted Sep 11, 2018 - 21:10 UTC
Update
We're currently adding additional nodes to the cluster.
Posted Sep 07, 2018 - 13:01 UTC
Update
We implemented a fix and the situation looks stable now. We started planning adding more resources to the cluster before the weekend. We will update the ticket as soon as new information becomes available.
Posted Sep 06, 2018 - 22:32 UTC
Update
We're currently investigating issues on ch1-lagoon. We'll update as soon as new information will become available.
Posted Sep 06, 2018 - 20:56 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 05, 2018 - 15:32 UTC
Investigating
The fix implemented earlier didn't solve the issue. We're looking into the issue.
Posted Sep 05, 2018 - 11:14 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 05, 2018 - 09:46 UTC
Investigating
We're seeing services flapping on ch1-lagoon our engineers look into the issue.
Posted Sep 05, 2018 - 09:45 UTC