Monday 3rd March 2025

Infrastructure [PAR] Power supply failure on region EU-FR-1 (Paris)

#Clever Cloud Incident – Explanations and Lessons Learned

Today we experienced an incident affecting our infrastructure. Here is a summary of the causes, ongoing analysis, and planned actions to strengthen our resilience.

Incident Timeline:

  1. Power Outage: A power failure reduced our computing capacity by one-third, highlighting the need to expand our infrastructure to five datacenters to better absorb such incidents.
  2. Network Issue: A network outage related to BGP announcements followed, revealing an underlying issue that requires further investigation to prevent recurrence.

Why Did Recovery Take Time?

  1. Machine Reconnection: The corrective measure to prevent overload during VM reconnection to Pulsar was not fully effective. An in-depth analysis is underway to improve this process.
  2. Orchestration Evolution: Our current system is reaching its limits. We are working on a new orchestration architecture to better manage recovery and optimize performance.

Next Steps:

We will publish a detailed post-mortem and schedule meetings with customers to:

  • Analyze the incident,
  • Explain upcoming changes,
  • Demonstrate our commitment to improving infrastructure resilience.

We will keep you informed about these actions. Thank you for your patience and trust.

The Clever Cloud Team

Detailed timeline :

EDIT 2:11pm (UTC): One of the Paris datacenters has experienced an electricity issue. Some hypervisors have been rebooted.

EDIT 2:21pm (UTC): Applications are being restarted, the situation is stabilizing.

EDIT 2:28pm (UTC): Monitoring is up and generating new statuses

EDIT 2:34pm (UTC): Recovering process is still ongoing. Customers can open a ticket though their email endpoint in addition to the Web Console.

EDIT 2:43pm (UTC): Infrastructure is under high load. We're accelerating the recovery process with load sanitization

EDIT 2:48pm (UTC): All Load balancers are now back in sync. Per service Availability :

  • Load Balancing (dedicated instances) : OK
  • Cellar Storage : OK
  • Orchestration : OK but recovering a lot of runtime instances.
  • API : OK
  • Metrics API : KO (network topology split)
  • Databases : Globally OK (individual situations being worked on)
  • Monitoring : OK (in sync from a couple of minutes)
  • Infrastructure overall load : high

EDIT 3:00pm (UTC): Deployments are done but still slow. Clever Cloud API is being restarted.

EDIT 3:05pm (UTC): Infrastructure load sanitization : 30%

EDIT 3:40pm (UTC): Overall situation is better. Still a few tousands VMs in the recovery queue.

EDIT 3:48pm (UTC): Most databases should be available. We're experiencing an additional delay with some encrypted databases.

EDIT 3:53pm (UTC): Infrastructure load sanitization : 100%

EDIT 4:15pm (UTC): Remaining Databases are recovered. Estimated total fix time is expected in the 10/15 next minutes.

EDIT 4:30pm (UTC): All applications should run fine (orchestration point of view). Orchestration monitoring is partially up and running (stuck apps will be quickly unstuck)

EDIT 5:10pm (UTC): All stuck applications should now be available (monitoring point of view).

EDIT 5:16pm (UTC): All databases are available

EDIT 5:44pm (UTC): Incident is considered closed (a few cases are still dealt with customers)