In October, we experienced one incident that resulted in degraded performance across GitHub services.
October 11 05:59 UTC (lasting 19 hours and 12 minutes)
On October 11, 2024, starting at 05:59 UTC, the DNS infrastructure in one of our sites started failing to resolve lookups following a database migration. Attempts to recover the database led to cascading failures that impacted the DNS systems for that site. While the team worked to restore the infrastructure, the first customer impact started around 17:31 UTC.
The impact of the incident was broad. 4% of Copilot users saw degradation in IDE code completions, while 25% of Actions workflow users experienced delays greater than 5 minutes. 100% of the code search requests failed for ~4 hour window.
Starting at 18:05 UTC, we attempted to resolve the issue by repointing the degraded DNS site to a different site but were unsuccessful. While this mitigation was effective at restoring connectivity within the site, it caused issues with connectivity from healthy sites back to the degraded site, so we started planning a different remediation effort.
At 20:52 UTC, the team finalized a remediation plan and began the next phase of mitigation by deploying temporary DNS resolution capabilities to the degraded site. At 21:46 UTC, DNS resolution in the degraded site began to recover and was fully healthy at 22:16 UTC. Lingering issues with code search were resolved at 01:11 UTC on October 12.
The team also continued to restore the original functionality within the site after public service functionality was restored. GitHub is working to harden our resiliency and automation processes around this infrastructure so we can diagnose and resolve issues like this faster in the future.
Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what weโre working on, check out the GitHub Engineering Blog.