Postmortem on the Paperspace outage of November 10, 2020 with the lessons we learned.
On November 10th, 2020, we experienced a major service outage affecting multiple Paperspace products. The outage was caused by a malicious attack on our systems which was ultimately thwarted and no customer data was compromised, deleted, or exposed.
While responding to the attack, we brought portions of our services offline, including a primary service that performs key lifecycle operations (eg events such as create, start, stop, etc.) to respond to the ongoing attack. As a result, this incident caused many of our products to be mostly, and in some cases, completely unavailable for several hours.
Investigating ongoing issues:
In the early hours of November 10th, we were inundated with alerts resulting from multiple services experiencing unprecedented load and timing-out. We first reported this publicly on our status page at 09:27 EST.
Tracing the source of the problem:
We deduced that we were under a DDoS attack due traffic saturation on our border networks which is consistent with this type of attack. Shortly after, we were able to trace the issue to a set of auxiliary servers (se
rvers not directly used by customers) that appeared to be compromised by the attacker. In response, we terminated and re-deployed these servers. Additionally, we brought down other services isolate portions of our network during the attack.
Bringing services back online:
Once we verified that the DDoS attack had been mitigated, we began bringing services back online. Although some services were restored quickly, it took longer than anticipated to restore full functionality during this attack. This is an area that can be improved on (see below).
Impact of attack on Paperspace customers:
Several of our services were down for over 12 hours and this is not acceptable. Many of our customers are running production workloads and while no service offers 100% uptime, we know we can do better.
Commitment to improve recovery procedures:
Our main focus is to redesign our disaster response plan as well as deprecate some backend services that made it difficult to quickly locate the root cause. Our systems are well documented in our engineer runbooks used by our on-call SRE team during incidents but there was confusion about a specific service that caused a significant delay. We plan to improve this documentation and process. Additionally, out of an abundance of caution, we preemptively brought down some of our own infrastructure to limit the scope and isolate against the attack. It took too long to bring everything back online (while ensuring integrity ie no lost data) after we were in the clear. For example, some workloads entered a hung state when they couldn’t communicate with certain backend services and we had to manually clean these up. By further isolating services at the network layer, we can limit the attack surface and avoid the scenario of taking down entire services in the future.