On July 06, 2021 between 21:38 - 22:41 UTC customers using Bitbucket Cloud products were unable to perform git operations over HTTPS/SSH. The event was triggered as the result of side effects from enabling a feature flag used to toggle new behavior in a caching layer used for authentication. As a result, all Bitbucket Cloud customers were impacted. The incident was detected within 2 minutes by automated internal monitoring systems and mitigated by disabling the feature flag which put Atlassian systems into a known good state. The total time to resolution was about 1 hour & 3 minutes.
The overall impact was between July 06, 2021, 21:38 PM UTC and July 06, 2021, 22:41 PM UTC on Bitbucket Cloud products. The Incident caused service disruption to all customers attempting to perform Git operations over HTTPS and SSH. Customers would have noticed that they were not able to execute commands such as git push, git pull, git clone and would instead receive HTTP 401/permission denied errors.
The issue was caused by a change to a feature flag targeting another part of the system (caching layer for authentication) which unexpectedly impacted Git services. As a result, the users of Bitbucket Cloud and products that integrate with Bitbucket cloud could not make git clone, git pull, git push, etc. The root cause of the incident was that the change created unexpected load on a caching layer which is shared between these two parts of the system. The lack of visibility into this recent and seemingly unrelated feature flag change hindered the engineers who were troubleshooting the issue from identifying the source of the problem right away.
We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because a change had been initiated by one team which had an adverse effect on services owned and maintained by a different team. The team that made the change monitored their own systems and confirmed that they were operating normally after the flag was enabled. The impact was reflected in the monitoring systems of the other team.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to those customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support