On March 24, 2016, DigitalOcean experienced a DNS outage that rendered its infrastructure unavailable for two hours and four minutes. During this period, the company was only able to process a fraction of the DNS requests directed to its servers. "We know how much you rely on DigitalOcean, and we take the availability and reliability of our services very seriously," stated a DigitalOcean spokesperson, reaffirming the company's commitment to its users.
"We know how much you rely on DigitalOcean, and we take the availability and reliability of our services very seriously,"
The issue began at 2:34 PM UTC when alerts were triggered, indicating that all resolvers had stopped responding to DNS queries. Initial investigations revealed an unprecedented surge in query volume. "We noticed that the resolvers were receiving orders of magnitude more queries than normal," explained the spokesperson. Despite having sufficient capacity to handle peak operations, the influx of requests overwhelmed the system.
"We noticed that the resolvers were receiving orders of magnitude more queries than normal,"
As the situation unfolded, DigitalOcean’s dedicated DDoS mitigation partner was engaged to analyze the abnormal traffic patterns. According to the spokesperson, "All of our DNS traffic flows through their network, which has numerous protections in place to both identify and mitigate attacks." However, neither party could initially identify the nature of the traffic beyond its sheer volume.

The company's technical team discovered that the DNS daemon was configured to clear the queue of unanswered queries. This led to unintentional cache invalidation, complicating recovery efforts. "Our DNS daemon was configured to empty the queue... which inadvertently caused cache invalidation," the spokesperson described. They implemented a new configuration to rectify this issue, but the resolvers struggled to rebuild their cache due to excessively high query loads.
By the Numbers
After further review, the team identified a notable increase in requests for PTR records within the excessive traffic. To alleviate the strain, DigitalOcean began blocking these specific queries and filtered traffic from certain autonomous system numbers, particularly those generating the most requests. "As we looked through the traffic to find patterns, it became clear that the attacker knew a large number of domains managed with our DNS infrastructure," said the spokesperson, shedding light on the targeted nature of the assault.
"As we looked through the traffic to find patterns, it became clear that the attacker knew a large number of domains managed with our DNS infrastructure,"
By the Numbers
In response, the team raised the Time to Live (TTL) for cached DNS records, allowing edge caches to retain responses longer, thus minimizing unnecessary pressure on origin resolvers. By 4:40 PM UTC, the service began to show signs of recovery, responding to queries with normal latency. "Caches began to repopulate and query volume returned to normal levels," they noted, reflecting on the successful measures taken.
"Caches began to repopulate and query volume returned to normal levels,"
By 5:30 PM UTC, the majority of traffic hitting the resolvers was authenticated and clean. The company shared metrics reinforcing the speed of recovery, with an increase in the cache hit rate observed as the system normalized. DigitalOcean's quick response highlights both their commitment to reliability and their proactive approach to networking challenges.

Ultimately, this incident serves as a reminder of the vulnerabilities cloud services can face, particularly when subjected to intensified traffic. DigitalOcean's collaborative efforts with its DDoS mitigation providers showcased the importance of swift action and the need for resilient DNS infrastructures.


