Bitbucket Cloud website outage

Incident Report for Atlassian Bitbucket

Postmortem

CONTEXT

Atlassian Cloud products, applications, the customer experiences they provide, as well as supporting services and Atlassian’s own internal tooling rely on compute and data workloads deployed in Amazon Web Services (AWS) Virtual Private Cloud (VPC).

Critical to the function of these applications is the network communication between them, as well as to the Internet and AWS-managed services. This communication depends on DNS resolution, which applications perform by querying a DNS server to translate service names into network addresses they can connect to.

To facilitate DNS resolution across multiple accounts and VPCs, Atlassian operates centralized EC2 (Elastic Compute Cloud) DNS servers in each AWS region we serve customers from.

Atlassian is currently in the process of migrating from this centralized infrastructure to a more resilient distributed solution that leverages AWS-managed VPC resolver in a Shared VPC architecture. However, a number of internal applications and services still rely on the EC2-based solution.

The EC2-based DNS servers utilize security groups, which are subject to connection tracking allowance limitsinstrumented by AWS.

INCIDENT SUMMARY

On November 22, 2022, between 19:39 and 21:28 UTC, some Jira, Confluence, and Bitbucket customers experienced varying levels of degradation across our products and services, including partner apps and integrations. The incident was detected within nine minutes by an automated monitoring system and resolved within one hour and 49 minutes.

The event was triggered by Atlassian’s DNS infrastructure within the AWS us-east-1 region encountering a network connection tracking allowance limit. Despite the fault being localized to one region, there was a global impact for customers whose data resided in us-east-1, as well as Atlassian products which have unique dependencies on that region.

The issue was initially mitigated by scaling up the DNS infrastructure, which increased the connection tracking allowance limit and returned Atlassian products to a healthy state. The underlying issue has since been resolved by a configuration change and cannot reoccur.

The incident was not caused by an attack and it was not a security issue. Atlassian customer data remains secure.

We deeply value the trust placed in Atlassian and apologize to customers who were affected by the event.

IMPACT

The impact occurred on November 22, 2022, between 19:39 and 21:28 UTC, for a total of one hour and 49 minutes. Some Jira, Confluence, and Bitbucket customers experienced service degradation ranging from intermittent access to a complete outage. Customers experienced error pages or very slow responses in applications and browsers, as well as 5xx response codes from our APIs.

ROOT CAUSE

The incident occurred as connections through Atlassian’s DNS infrastructure reached a new daily peak due to steady traffic growth. As a result, a limit on the number of simultaneous network connections tracked by an AWS EC2 security group was encountered. Upon reaching this limit, DNS packets were dropped, meaning services were unable to resolve network addresses which resulted in application failures. Services retried their DNS queries upon receiving a SERVFAIL response or query timeout, which created even more connections - compounding the problem.

During the incident, Atlassian’s DNS infrastructure in the AWS us-east-1 region was unable to service up to 90% of DNS resolution requests.

We were not aware of how close our infrastructure was to the security group connection tracking limit because utilization of this allowance is not currently observable.

Troubleshooting took an extended period of time because EC2 network allowance packet drops (as a result of encountering this limit) were not actively monitored by Atlassian.

REMEDIAL ACTIONS PLAN & NEXT STEPS

Atlassian acknowledges that outages like this one impact the productivity and business of our customers. Since this incident:

We have deployed an immediate change to prevent traffic through Atlassian’s EC2-based DNS infrastructure from consuming the security group connection allowance limit. As a result of this change, this incident cannot reoccur.
We are investigating ways to improve our visibility into the utilization of AWS network allowance limits and will monitor packet drops due to them.
We are continuing our migration of internal services away from the EC2-based DNS infrastructure involved in this incident to a new, distributed architecture that is not subject to these network limits.

Again, we apologize to those customers whose services were impacted during this incident.

Thank you,
Atlassian Customer Support

Posted Dec 03, 2022 - 00:33 UTC

Resolved

On Nov 22 at 20:46 UTC we identified a temporary outage with several Atlassian Cloud products. All affected products are now back online and no further impact has been observed.
We will publish a post incident review with details of the incident and the actions we are taking to prevent a similar problem in the future.

Posted Nov 22, 2022 - 23:44 UTC

Monitoring

We have taken action to mitigate an issue within our networking services. We will continue to monitor for the next 30 minutes before marking the incident as resolved.

Posted Nov 22, 2022 - 23:14 UTC

Update

We are continuing to investigate this issue.

Posted Nov 22, 2022 - 20:59 UTC

Investigating

We're currently investigating an outage in Bitbucket Cloud and other products. We will provide more information as we work to identify the root cause.

Posted Nov 22, 2022 - 20:10 UTC

This incident affected: Website, API, Git via SSH, Authentication and user management, Git via HTTPS, and Pipelines.