Between 1230 UTC and 1600 UTC on September 1st 2020, our upstream hosting provider, Cloudscale, suffered a DDoS attack which caused service outages for some of our customers in the ch1 region during this time period.
Cloudscale's postmortem for this event can be viewed here: https://cloudscale-status.net/incident/146
Our Swiss hosting provider, Cloudscale, was the subject of a DDoS attack targeting a single IP, which belonged to one of our load balancers. This load balancer was a particularly good target, because many of our customers in this region had DNS records which only pointed to this load balancer instead of to both of our load balancers in the ch1 region.
Sites were down for all customers in the region for the first 28 minutes of the attack. Our customers utilizing the amazee.io CDN were still able to serve stale pages from their cached content, but any authenticated traffic would not have been able to successfully connect. Once inbound traffic was black holed at the upstream provider for the affected IP, the ch1 region's accessibility was restored.
Any customers with misconfigured DNS records pointing at only the single affected load balancer IP continued to see downtime due to the black holing of all traffic to that IP. Once the attack concluded and the black holing was removed from the affected IP address, availability for all sites in the region was restored.
The main reason this attack had such a large impact to our services was the fact that many of our customers in the ch1 region only ever added a single A record to their DNS configurations. Where possible, we have ensured our internal DNS CNAME records use both load balancer IP addresses, and have notified any customers using only one IP to adjust their configuration to include both load balancer IPs.
The subset of affected customers who were paying for high availability were also provided with a migration to the amazee.io CDN in order to mitigate any DDoS issues happening to them in the future. Our long term plan is to move all customers to the amazee.io CDN, but that is not possible in the short term.
While the attack was underway, a traffic black hole solution was employed in order to stem the tide of this tremendous amount of data being transferred, which completely brought down any sites that only used a single load balancer IP in their DNS records for the duration of the black hole solution. This was a temporary solution, and would not be suitable for long term mitigation of an attack of this magnitude.
The root cause of the majority of the site unavailability issues was the DDoS attack against our hosting provider, Cloudscale, with only black holing as DDoS protection for many of our customers in the ch1 region. We are working with Cloudscale to implement and test a more sophisticated mitigation strategy for future DDoS attacks at the infrastructure level, which does not require us to blackhole legitimate traffic
Misconfigured customer DNS records also contributed to the issues and lead to extended downtimes for some customers. We have reached out to all affected customers with instructions on how to rectify these issues moving forward.
All sites hosted in the ch1 region were down for the first 28 minutes of the attack, as Cloudscale's inbound network capacity was completely consumed by the attack. Customers already utilizing the amazee.io CDN would not have seen any downtime reported, as the CDN could serve stale pages, but authenticated traffic to those sites would have not have been able to connect during this initial period.
Once the black hole solution was put into place, around 60 sites hosted in the ch1 region continued to experience degraded availability as their DNS records were configured to point only to the single load balancer IP address which was the direct target of this attack. These sites' services were restored with the removal of the black hole solution from the load balancer IP address, having experienced approximately three hours and fifteen minutes of downtime over the course of the DDoS attack.
Numerous amazee.io employees and Cloudscale employees worked together to mitigate the effects of this attack.
Communications between Cloudscale and amazee.io as well as between amazee.io and affected customers were superb. Everyone involved disseminated information freely as soon as it became available to them.
The level of cooperation between amazee.io team members in mitigating this attack, identifying affected customers, and communicating with those customers was also commendable.
Identifying which customers were affected by this attack due to misconfigured DNS records took almost an entire hour; in the event of future attacks, we should have this information more readily available somehow.
Customer communications went well, but our status page updates were a bit sporadic throughout the event. Customers asked if our twitter account was still a good place to receive information on outages, meaning that our status page updates weren't making it to that account. We should ensure that any future status page updates are always pushed out to our Twitter account as well.
The full timeline of events can be viewed at this link: https://drive.google.com/file/d/1FLb3wmSezxZj2GfXNNToX01G5A08ttOf/view?usp=sharing