Web Portal Access and other services Issue

Incident Report for StreamShark

Postmortem

Incident Impact and Summary:

StreamShark alerting and monitoring systems, and numerous end-users reported issues logging into the StreamShark portal at https://app.streamshark.io and accessing StreamShark services. The outage extended to other services and end-points at StreamShark, including VoD and Event rendering (if they were not already cached).

The issue was caused by a major global outage from one of our primary Cloud infrastructure vendors (Google Cloud) which impacted a significant number of Google hosted services, and all Google Cloud Availability Zones (AZ). The relevant Google Cloud incident can be found here, which now includes a full incident report with a root cause analysis. Google summarises the function impacted system below:

Google and Google Cloud APIs are served through our Google API management and control planes. Distributed regionally, these management and control planes are responsible for ensuring each API request that comes in is authorized, has the policy and appropriate checks (like quota) to meet their endpoints. The core binary that is part of this policy check system is known as Service Control. Service Control is a regional service that has a regional datastore that it reads quota and policy information from. This datastore metadata gets replicated almost instantly globally to manage quota policies for Google Cloud and our customers.

Per Google’s incident report, Google pushed an insufficiently tested and faulty change to the above mentioned internal Google system, which was subsequently replicated globally, causing widespread service failure across Google hosted services, without appropriate rollback controls. As a result, service recovery took significantly longer than anticipated.

Impacted components:

  • StreamShark Admin Portal
  • StreamShark API/Backend
  • All StreamShark service end-points (VoD, Event, Channel, etc) where the end-point was not already cached

Overview:

Google Cloud is one of multiple Cloud Infrastructure providers used by StreamShark, but a significant one. StreamShark operates across multiple Google Cloud Availability Zones depending on the service and infrastructure needed, per industry standard best practice.

The Google Cloud issue impacted the majority of Google Cloud services across all available Google Cloud regions. Many StreamShark components were impacted for different durations - the times below encompass the widest possible period of impact.

Detection:

Detected by End-user reports and StreamShark automated alerting. The StreamShark status page was updated during the issue.

  • First Detected: Thursday, 12 June 2025 at 18:20 (UTC)
  • Mitigated: Thursday, 12 June 2025 at 20:28 (UTC)
  • Ongoing Monitoring: Thursday, 12 June 2025 at 20:45 (UTC)

Remediation:

The extent of the incident, simultaneously across multiple services and availability zones was unprecedented - which meant there were minimum levers at our disposal in terms of mitigating the issue in real time. The issue related to an internal quota and policy system in Google Cloud, of which we have no control or visibility on. Google Cloud’s Enterprise Support portal was unavailable during the impacted period so we could not raise a ticket via our Enterprise Support arrangement with Google. However given the scale and global impact of this outage across multiple customers and industries, it would have not expedited any resumption of service.

In Google’s Incident Report, they have outlined a number of steps they will take to avoid this type of issue from occurring again. Google's response is below:

What’s our approach moving forward?

Beyond freezing the system as mentioned above, we will prioritize and safely complete the following:

We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests.

We will audit all systems that consume globally replicated data. Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues.

We will enforce all changes to critical binaries to be feature flag protected and disabled by default.

We will improve our static analysis and testing practices to correctly handle errors and if need be fail open.

We will audit and ensure our systems employ randomized exponential backoff.

We will improve our external communications, both automated and human, so our customers get the information they need asap to react to issues, manage their systems and help their customers.

We'll ensure our monitoring and communication infrastructure remains operational to serve customers even when Google Cloud and our primary monitoring products are down, ensuring business continuity.

 

Whilst we are multi availability and multi vendor across many services, some key services are still hosted on single global Cloud Vendor. We will conduct a feasibility analysis of running more essential StreamShark services across multiple cloud vendors (in a hot/cold failover model, or a dual master model) in the near future.

Posted Jun 20, 2025 - 05:34 UTC

Resolved

The incident was resolved at Thursday, 12 June 2025 at 20:45 (UTC). All services have been verified, and are operating as expected. We will continue to monitor to ensure ongoing stability.
Posted Jun 13, 2025 - 02:04 UTC

Update

All services have been fully restored and are operating normally. We will continue to monitor platform metrics to ensure ongoing stability and performance.
Posted Jun 12, 2025 - 21:10 UTC

Monitoring

Services have been restored for the majority of users, and we are continuing to monitor system performance closely.
Posted Jun 12, 2025 - 20:45 UTC

Identified

The root cause of the issue has been identified, and mitigation measures have been applied. While some users may still experience intermittent issues, the Web Portal and related services are gradually being restored.
Posted Jun 12, 2025 - 20:28 UTC

Update

We are currently experiencing issues with access to the Web Portal and partial disruptions across several services. Our team is actively investigating the root cause, and we will provide further updates as more information becomes available.
Posted Jun 12, 2025 - 19:11 UTC

Update

We are continuing to investigate this issue.
Posted Jun 12, 2025 - 18:50 UTC

Investigating

We are aware of an access issue affecting the Web Portal and are currently investigating the cause. We will provide updates as more information becomes available.
Posted Jun 12, 2025 - 18:48 UTC
This incident affected: Web Portal, Streaming Events, Video on Demand, Live Scheduler, and Video Player.