Ch1 DDoS Attack

Incident Report for amazee.io

Postmortem

Overview

Between 1230 UTC and 1600 UTC on September 1st 2020, our upstream hosting provider, Cloudscale, suffered a DDoS attack which caused service outages for some of our customers in the ch1 region during this time period.

Cloudscale's postmortem for this event can be viewed here: https://cloudscale-status.net/incident/146

What Happened

Our Swiss hosting provider, Cloudscale, was the subject of a DDoS attack targeting a single IP, which belonged to one of our load balancers. This load balancer was a particularly good target, because many of our customers in this region had DNS records which only pointed to this load balancer instead of to both of our load balancers in the ch1 region.

Sites were down for all customers in the region for the first 28 minutes of the attack. Our customers utilizing the amazee.io CDN were still able to serve stale pages from their cached content, but any authenticated traffic would not have been able to successfully connect. Once inbound traffic was black holed at the upstream provider for the affected IP, the ch1 region's accessibility was restored.

Any customers with misconfigured DNS records pointing at only the single affected load balancer IP continued to see downtime due to the black holing of all traffic to that IP. Once the attack concluded and the black holing was removed from the affected IP address, availability for all sites in the region was restored.

Resolution

The main reason this attack had such a large impact to our services was the fact that many of our customers in the ch1 region only ever added a single A record to their DNS configurations. Where possible, we have ensured our internal DNS CNAME records use both load balancer IP addresses, and have notified any customers using only one IP to adjust their configuration to include both load balancer IPs.

The subset of affected customers who were paying for high availability were also provided with a migration to the amazee.io CDN in order to mitigate any DDoS issues happening to them in the future. Our long term plan is to move all customers to the amazee.io CDN, but that is not possible in the short term.

While the attack was underway, a traffic black hole solution was employed in order to stem the tide of this tremendous amount of data being transferred, which completely brought down any sites that only used a single load balancer IP in their DNS records for the duration of the black hole solution. This was a temporary solution, and would not be suitable for long term mitigation of an attack of this magnitude.

Contributing Factors

The root cause of the majority of the site unavailability issues was the DDoS attack against our hosting provider, Cloudscale, with only black holing as DDoS protection for many of our customers in the ch1 region. We are working with Cloudscale to implement and test a more sophisticated mitigation strategy for future DDoS attacks at the infrastructure level, which does not require us to blackhole legitimate traffic

Misconfigured customer DNS records also contributed to the issues and lead to extended downtimes for some customers. We have reached out to all affected customers with instructions on how to rectify these issues moving forward.

Impact

All sites hosted in the ch1 region were down for the first 28 minutes of the attack, as Cloudscale's inbound network capacity was completely consumed by the attack. Customers already utilizing the amazee.io CDN would not have seen any downtime reported, as the CDN could serve stale pages, but authenticated traffic to those sites would have not have been able to connect during this initial period.

Once the black hole solution was put into place, around 60 sites hosted in the ch1 region continued to experience degraded availability as their DNS records were configured to point only to the single load balancer IP address which was the direct target of this attack. These sites' services were restored with the removal of the black hole solution from the load balancer IP address, having experienced approximately three hours and fifteen minutes of downtime over the course of the DDoS attack.

Numerous amazee.io employees and Cloudscale employees worked together to mitigate the effects of this attack.

What Went Well?

Communications between Cloudscale and amazee.io as well as between amazee.io and affected customers were superb. Everyone involved disseminated information freely as soon as it became available to them.

The level of cooperation between amazee.io team members in mitigating this attack, identifying affected customers, and communicating with those customers was also commendable.

What Didn't Go So Well?

Identifying which customers were affected by this attack due to misconfigured DNS records took almost an entire hour; in the event of future attacks, we should have this information more readily available somehow.

Customer communications went well, but our status page updates were a bit sporadic throughout the event. Customers asked if our twitter account was still a good place to receive information on outages, meaning that our status page updates weren't making it to that account. We should ensure that any future status page updates are always pushed out to our Twitter account as well.

Action Items

Follow up with all customers who have misconfigured DNS records to ensure these records are corrected as soon as practical.
Continue the migrations to the amazee.io CDN for our ch1 customers selected for migration.
Work with Cloudscale to ensure the mitigation system put into place is operable and effective at preventing these types of attacks in the future.

‌

The full timeline of events can be viewed at this link: https://drive.google.com/file/d/1FLb3wmSezxZj2GfXNNToX01G5A08ttOf/view?usp=sharing

Posted Sep 03, 2020 - 18:34 UTC

Resolved

This incident has been resolved.

Posted Sep 03, 2020 - 17:12 UTC

Update

In the last hour we have not seen anymore issues with any sites.
Additionally we have now infrastructure level mitigations in place to protect us for a future DDoS Attack. This mitigation can be added at a moments notice. The 24/7 on-call systems engineer team is monitoring the traffic and will react immediately if another DDoS Attack is detected.
We plan to release a post-mortem in 24h that gives a detailed overview of the DDoS attack.

Posted Sep 01, 2020 - 16:29 UTC

Update

We are continuing to monitor for any further issues.

Posted Sep 01, 2020 - 15:07 UTC

Update

We are still seeing intermittent connectivity issues and are continuing to evaluate possible resolutions

Posted Sep 01, 2020 - 15:05 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 01, 2020 - 14:47 UTC

Update

The DDoS attack has stopped a couple of minutes ago and we removed the black-holing for the time being, this makes all sites reachable again.
We very strongly expect that this was only a 2-hour warning attack and a full attack would be much longer sustained.
In the meantime we started to implement an DDoS mitigation on infrastructure level which will allow us handle a future DDoS Attack and keep the sites up to date. We expect this mitigation to be in place in the next couple of hours.

Posted Sep 01, 2020 - 14:47 UTC

Update

Unfortunately the DDoS attack is still going on against our infrastructure and we're working on strategies to mitigate the issue.

Posted Sep 01, 2020 - 13:56 UTC

Update

We identified the outage as a DDoS attack by a malicious attacker group. As the attack is still going on we have black-holed the traffic to our servers in order to protect the infrastructure. As soon as the attack is over (expected in the next minutes) we allow traffic again and will define a strategy to move forward as there might be a second attack following.

Posted Sep 01, 2020 - 13:27 UTC

Identified

Our hosting partner has identified the issue connected to unusual traffic patterns: https://cloudscale-status.net/incident/145
we're working with the hosting partner to fully restore our services.

Posted Sep 01, 2020 - 12:54 UTC

Update

We are continuing to investigate this issue.

Posted Sep 01, 2020 - 12:42 UTC

Investigating

We are seeing websites being unreachable on the CH1 cluster, we're investigating.

Posted Sep 01, 2020 - 12:37 UTC

This incident affected: General (Lagoon API, Deployment Infrastructure, Lagoon Dashboard, Lagoon Logs (Kibana)).