Portal Access Issue
Incident Report for StreamShark
Postmortem

Post Mortem - StreamShark Portal Access Issues

Summary

On Monday, November 11, 2024, at 00:43 UTC, users began experiencing elevated error rates with the StreamShark management portal login function, preventing some users from accessing the portal. The issue was identified and mitigated by 02:39 UTC, restoring access to the portal, and fully resolved at 05:52 am UTC. We sincerely apologize to our customers who were affected during this time and regret any inconvenience this may have caused.

Customer Impact:

Before November 11, 2024, we received a small number of reports regarding one-off portal login failures, most of which were resolved upon refreshing the page.

From November 11, 2024, 00:43 to 02:39 UTC (1 hour, 56 minutes):

  • Portal Access: Users may have experienced difficulty logging into the portal via both standard login and SSO login methods.
  • VoD Deployment: New Video on Demand (VoD) deployment was delayed for some users.
  • Encoder Connection: Initial connection/publish attempts from both software encoders (e.g. OBS, vMix) and hardware encoders (e.g. Elemental, Pearl) may have been incorrectly rejected.

All other services were unaffected, in particular

  • Player Consumption: Playback across all platforms (Streaming Events, Video on Demand, Live Schedule) was unaffected, with continuous uninterrupted service.
  • Portal functionality: Once logged into the portal, all standard functions operated as expected.

Root cause:

The issue originated from a server-side function deployed within a managed cloud environment that began returning inconsistent results when extracting IP addresses for incoming requests.

The cause of the inconsistency was a recent change in the cloud vendor’s environment which affected the function result. In a subset of cloud instances, instead of returning the expected public IP address, the function incorrectly returned ‘127.0.0.1’ (localhost). Initially, this staged rollout by the cloud vendor affected only a very small number of users, but the impact escalated on November 11, when a wider number of instances were updated.

This error resulted in failed authentication attempts impacting login and other relevant functionalities where accurate IP addresses (combined with other authentication tokens) are essential for security whitelist requirements.

Corrective Actions:

  • Function Update: The IP extraction function was re-engineered to handle the inconsistent behavior observed from the cloud provider, ensuring accurate IP address retrieval.
  • Code Audit: All functionalities dependent on IP accuracy were reviewed and updated to accommodate potential inconsistencies, preventing delays in future service deployment.
  • Enhanced Monitoring: New alerting mechanisms were implemented to monitor authentication requests failures related to IP validation. These alerts have been integrated into the standard on-call alerting system to proactively detect and address similar issues in the future.
Posted Nov 13, 2024 - 03:52 UTC

Resolved
The issue causing errors accessing the Portal has been fully resolved as of 05:52am 11/11/2024 UTC. We thank you for your patience while we worked on resolving the issue.
Posted Nov 11, 2024 - 05:52 UTC
Identified
The issue cause has been identified and mitigation has been applied to restore portal access.
Posted Nov 11, 2024 - 02:39 UTC
Investigating
We are seeing elevated error rates for requests affecting portal login and a few other service components. We are currently investigating with our backend vendor to determine root cause. Further updates to follow.
Posted Nov 11, 2024 - 00:43 UTC
This incident affected: Video on Demand, Video Player, and Storage.