Severe Network outage for services hosted in the CH region

Incident Report for amazee.io

Postmortem

Following you will find the summary of our Post Mortem Report, please find the Link to our Full Post Mortem Report at the end of the Summary.

Summary

Due to networking issues on our infrastructure provider, we observed several outages on Friday 2019-11-22 spanning from 06:48 UTC to fully being back online at 13:00 UTC. While the incident spans for several hours, the sites affected were not offline during the entire period since our systems and the network recovered several times during the outage.

During those outages external and internal network connectivity between the servers was lost several times. This led to additional issues with Database and Storage Servers that required interventions from our engineers.

We will be following up with the customers that have already been in touch and are looking into the SLA Considerations for each site individually.

Full Post Mortem Report: https://docs.google.com/document/d/13oicAVjZeUvye1O_h40iI-VhDlU7knnbmomn4PUyzaQ/edit

Posted Dec 04, 2019 - 18:37 UTC

Resolved

The infrastructure has been stable since Friday afternoon. We'll close the incident for now and will follow up with a post-mortem report. In the meantime, there's also more information available on the statuspage of cloudscale https://cloudscale-status.net/incident/110

Posted Nov 25, 2019 - 08:52 UTC

Update

The infrastructure has been stable for the past 18 hours. We are continuing to monitor the infrastructure very closely together with our infrastructure provider.

Posted Nov 23, 2019 - 11:15 UTC

Update

The infrastructure has been stable in the last 3 hours without any issues. We are continuing to monitor the infrastructure very closely together with the infrastructure provider. We are also preparing a full Postmortem, we expect this to be ready mid next week.

Posted Nov 22, 2019 - 16:59 UTC

Update

Storage and Networking at Cloudscale is stable again. We are cautious and are still working with our customers on disaster recovery procedures to be able to handle another outage.

Posted Nov 22, 2019 - 13:30 UTC

Monitoring

Storage and networking has been restored. We continue to work on disaster recovery procedures.

Posted Nov 22, 2019 - 12:05 UTC

Identified

The Infrastructure provider is experiencing another outage and are working on restoring services. We continue to work on disaster recovery procedures.

Posted Nov 22, 2019 - 11:50 UTC

Monitoring

Our Infrastructure provider Cloudscale has been able to restore all storage and networking connectivity. All sites and services of amazee.io are currently working again. We continue to work on our started disaster recovery procedures which will ensure better stability if the infrastructure has another outage.

Posted Nov 22, 2019 - 11:36 UTC

Update

We're in close contact with Cloudscale and are evaluating options for disaster recovery. We will follow up with an update in about 1 hour.

Posted Nov 22, 2019 - 11:01 UTC

Update

We continue to investigate the situation - as far as we can see the currently we are not able to reach the network. We rely on a solution from cloudscale to reestablish network connectivity - Please see their statuspage for additional updates https://cloudscale-status.net/

Posted Nov 22, 2019 - 10:06 UTC

Update

Connectivity to the services and sites is severely impacted as the network is fully down at this time. We're continuing to work on the situation but we rely on a solution from cloudscale to reestablish network connectivity.

Posted Nov 22, 2019 - 09:47 UTC

Update

We're working closely with Cloudscale to look into the networking issues - please refer to https://cloudscale-status.net/ for more information about the network status.

Posted Nov 22, 2019 - 09:37 UTC

Update

We are continuing to investigate this issue.

Posted Nov 22, 2019 - 09:34 UTC

Update

We are continuing to investigate the issue. We're working closely with cloudscale to look into the severe network connectivity issues we observe.

Posted Nov 22, 2019 - 09:27 UTC

Investigating

We are currently investigating this issue.

Posted Nov 22, 2019 - 09:19 UTC

Monitoring

All sites and services are back - we are continuing to monitor the situation

Posted Nov 22, 2019 - 08:31 UTC

Update

The situation seems to resolve slowly - we're checking on the sites that still indicate issues in the monitoring.

Posted Nov 22, 2019 - 08:15 UTC

Identified

The outage was caused by an outage of network infrastructure of CH1 cloud provider (see https://cloudscale-status.net). We are currently working to restore the cluster stability.
Thanks

Posted Nov 22, 2019 - 07:38 UTC

Investigating

We are currently investigating this issue.

Posted Nov 22, 2019 - 07:24 UTC

This incident affected: General (Lagoon API, Deployment Infrastructure, Lagoon Dashboard, Lagoon Logs (OpenSearch)).