Friday 2nd August 2024

Infrastructure Global outage

We are experiencing a global outage. We observed a network split in addition to an event bus outage. The effect has been inpactful for some core services.

EDITS :

  • 2:00 PM CEST - Core services are being recovered and Deployments are being reloaded. This will synchronize back load balancers for customer's application trying to reach their new deployments.
  • 2:08 PM CEST - Some services are being shut to accelerate the recovery process. Expect disturbed experience for observability and deployments for a few minutes
  • 2:29 PM CEST - Criticial Core services are OK. Deployments are being rolled out.
  • 3:07 PM CEST - Some workload queues have still difficulties to be processed. Some components may still be in an unstable state. Current effort is to identify them, then reload them.
  • 3:40 PM CEST - Some hypervisors have experienced some crashes. Recovery process is occuring and will take a couple of minutes
  • 3:56 PM CEST - Some hypervisors seems still experiencing network issues.
  • 4:16 PM CEST - Apps are being deployed for premium customers. All apps are going to be deployed. Anyone can accelerate the process for its own application by manually deploying them.
  • 4:24 PM CEST - In the meantime, we continue to identify noisy VMs that have been impacted by the outage
  • 5:15 PM CEST - Metrics API is being restarted.
  • 6:20 PM CEST - Last deployments are being rolled out. Reminder : accelerate by triggering a redeploy action
  • 6:30 PM CEST - Still a few hundreds of VMs are consuming very high CPU rates and being cleaned.
  • 6:35 PM CEST - We estimate approximately 40min to have full recovered all deployment of applications (MANUALY REDEPLOY FOR FASTER RECOVERY)
  • 7:05 PM CEST - All IPSec links should be back online