Incident Impact and Summary:
StreamShark alerting and monitoring systems, and numerous end-users reported issues logging into the StreamShark portal at https://app.streamshark.io and accessing StreamShark services. The outage extended to other services and end-points at StreamShark, including VoD and Event rendering (if they were not already cached).
The issue was caused by a major global outage from one of our primary Cloud infrastructure vendors (Google Cloud) which impacted a significant number of Google hosted services, and all Google Cloud Availability Zones (AZ). The relevant Google Cloud incident can be found here, which now includes a full incident report with a root cause analysis. Google summarises the function impacted system below:
Google and Google Cloud APIs are served through our Google API management and control planes. Distributed regionally, these management and control planes are responsible for ensuring each API request that comes in is authorized, has the policy and appropriate checks (like quota) to meet their endpoints. The core binary that is part of this policy check system is known as Service Control. Service Control is a regional service that has a regional datastore that it reads quota and policy information from. This datastore metadata gets replicated almost instantly globally to manage quota policies for Google Cloud and our customers.
Per Google’s incident report, Google pushed an insufficiently tested and faulty change to the above mentioned internal Google system, which was subsequently replicated globally, causing widespread service failure across Google hosted services, without appropriate rollback controls. As a result, service recovery took significantly longer than anticipated.
Impacted components:
Overview:
Google Cloud is one of multiple Cloud Infrastructure providers used by StreamShark, but a significant one. StreamShark operates across multiple Google Cloud Availability Zones depending on the service and infrastructure needed, per industry standard best practice.
The Google Cloud issue impacted the majority of Google Cloud services across all available Google Cloud regions. Many StreamShark components were impacted for different durations - the times below encompass the widest possible period of impact.
Detection:
Detected by End-user reports and StreamShark automated alerting. The StreamShark status page was updated during the issue.
Remediation:
The extent of the incident, simultaneously across multiple services and availability zones was unprecedented - which meant there were minimum levers at our disposal in terms of mitigating the issue in real time. The issue related to an internal quota and policy system in Google Cloud, of which we have no control or visibility on. Google Cloud’s Enterprise Support portal was unavailable during the impacted period so we could not raise a ticket via our Enterprise Support arrangement with Google. However given the scale and global impact of this outage across multiple customers and industries, it would have not expedited any resumption of service.
In Google’s Incident Report, they have outlined a number of steps they will take to avoid this type of issue from occurring again. Google's response is below:
What’s our approach moving forward?
Beyond freezing the system as mentioned above, we will prioritize and safely complete the following:
We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests.
We will audit all systems that consume globally replicated data. Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues.
We will enforce all changes to critical binaries to be feature flag protected and disabled by default.
We will improve our static analysis and testing practices to correctly handle errors and if need be fail open.
We will audit and ensure our systems employ randomized exponential backoff.
We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers.
We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity.
Whilst we are multi availability and multi vendor across many services, some key services are still hosted on single global Cloud Vendor. We will conduct a feasibility analysis of running more essential StreamShark services across multiple cloud vendors (in a hot/cold failover model, or a dual master model) in the near future.