Hardware Encoder polling issue

Incident Report for StreamShark

Postmortem

Incident Impact

Hardware Encoder commands generated for managed hardware encoders could not be sent out to hardware encoders. This affected the following workflows:

  • Hardware encoder provisioning (for event create)
  • Hardware encoder deprovisioning (for event delete)
  • Hardware encoder stream start
  • Hardware encoder stream stop

The incident impact was from approximately 2:59 PM PDT to 3:21 PM PDT on the 25th of March 2026.

Overview

StreamShark makes use of Google Cloud infrastructure to manage communications between our system and managed hardware encoders. When customers perform actions via the StreamShark portal — such as creating or deleting events, or starting and stopping streams — commands are generated and placed into Google managed queues (Cloud Tasks) for delivery to the corresponding encoder.

A transient error in the Google Cloud Tasks service prevented our system from submitting commands into these queues. As a result:

  • Hardware encoder commands could not be delivered. This prevented tasks dependent on encoders starting (e.g. Location Session starts) from being actioned.
  • The Hardware Encoder management page incorrectly showed all encoders as down (not polling).

Detection

The issue was identified following customer reports that recording sessions linked to managed hardware encoders in the Locations product could not be started.

Root Cause Detail

A sustained transient error in the Google Cloud Tasks service prevented our system from submitting commands into the queue. Our system automatically retries transient errors with backoff, but in this case the errors persisted for the duration of the incident.

The issue resolved itself at approximately 3:21 PM PDT and normal command processing resumed shortly after. Encoder status reporting in the portal was temporarily delayed during recovery but was restored to an accurate state.

Remediation

We are taking the following steps to prevent this type of incident from occurring again and to improve recovery speed:

  • Improved monitoring and alerting — we are implementing tighter alerting thresholds to detect queue processing issues earlier and reduce time to resolution.
  • Infrastructure resilience — we are adding safeguards to ensure our backend services remain available during transient upstream issues, reducing recovery time.
  • Fallback processing paths — we are investigating alternative approaches to command delivery and encoder status reporting so that a single upstream dependency failure does not block these workflows.
Posted Mar 27, 2026 - 03:45 UTC

Resolved

The issue was identified as a problem with the cloud provider’s task queue service, which caused tasks to not be processed correctly. This has now been resolved, and operations have returned to normal. Preventative measures will be implemented to mitigate similar incidents in the future, and a detailed post-mortem will be shared soon.
Posted Mar 25, 2026 - 22:21 UTC

Investigating

We have observed potential polling issues with the Hardware Encoder that may be delaying operation processing.
Posted Mar 25, 2026 - 21:59 UTC
This incident affected: Streaming Events.