In September, we experienced three incidents that resulted in degraded performance across GitHub services.
September 16 21:11 UTC (lasting 57 minutes)
On September 16, 2024, between 21:11 UTC and 22:08 UTC, GitHub Actions and GitHub Pages services were degraded. Customers who deploy Pages from a source branch experienced delayed runs. We determined the root cause to be a misconfiguration in the service that manages runner connections that led to CPU throttling and performance degradation in that service. Actions jobs experienced average delays of 23 minutes, with some jobs experiencing delays as high as 45 minutes. During the course of the incident, 17% of runs were delayed by more than five minutes. At peak, as many as 80% of runs experienced delays exceeding five minutes.
We mitigated the incident by diverting runner connections away from the misconfigured nodes, starting at 21:16 UTC. In addition to addressing the configuration issue we discovered through this, we have improved our general monitoring to reduce the risk of a similar recurrence and reduce our time to automated detection and mitigation of issues like this in the future.
September 24 08:20 UTC (lasting 44 minutes)
On September 24, 2024 from 08:20 UTC to 09:04 UTC the GitHub Codespaces service experienced an interruption in network connectivity, leading to an approximate 25% error rate for the outage period. We traced the cause to an interruption in network connectivity caused by Source Network Address Translation (SNAT) port exhaustion following a deployment, causing individual codespaces to lose their connection to the service. To mitigate the impact, we increased port allocations to give enough buffer for increased outbound connections shortly after deployments. We will be scaling up our outbound connectivity in the near future, as well as adding improved monitoring of network capacity to prevent future regressions.
September 30 10:43 UTC (lasting 43 minutes)
On September 30, 2024 from 10:43 UTC to 11:26 UTC GitHub Codespaces customers in the Central India region were unable to create new codespaces. Resumes were not impacted and there was no impact to customers in other regions. We traced the cause to storage capacity constraints in the region and mitigated by temporarily redirecting create requests to other regions. Afterwards, we added additional storage capacity to the region and traffic was routed back. We also identified a bug that caused some available capacity not to be utilized, artificially constraining capacity and halting creations in the region prematurely. We have since fixed this bug as well so that available capacity scales as expected according to our capacity planning projections.
Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what weโre working on, check out the GitHub Engineering Blog.