#Clever Cloud Incident – Explanations and Lessons Learned
Today we experienced an incident affecting our infrastructure. Here is a summary of the causes, ongoing analysis, and planned actions to strengthen our resilience.
Incident Timeline:
-
Power Outage: A power failure reduced our computing capacity by one-third, highlighting the need to expand our infrastructure to five datacenters to better absorb such incidents.
-
Network Issue: A network outage related to BGP announcements followed, revealing an underlying issue that requires further investigation to prevent recurrence.
Why Did Recovery Take Time?
-
Machine Reconnection: The corrective measure to prevent overload during VM reconnection to Pulsar was not fully effective. An in-depth analysis is underway to improve this process.
-
Orchestration Evolution: Our current system is reaching its limits. We are working on a new orchestration architecture to better manage recovery and optimize performance.
Next Steps:
We will publish a detailed post-mortem and schedule meetings with customers to:
- Analyze the incident,
- Explain upcoming changes,
- Demonstrate our commitment to improving infrastructure resilience.
We will keep you informed about these actions. Thank you for your patience and trust.
The Clever Cloud Team
Detailed timeline :
EDIT 2:11pm (UTC): One of the Paris datacenters has experienced an electricity issue. Some hypervisors have been rebooted.
EDIT 2:21pm (UTC): Applications are being restarted, the situation is stabilizing.
EDIT 2:28pm (UTC): Monitoring is up and generating new statuses
EDIT 2:34pm (UTC): Recovering process is still ongoing. Customers can open a ticket though their email endpoint in addition to the Web Console.
EDIT 2:43pm (UTC): Infrastructure is under high load. We're accelerating the recovery process with load sanitization
EDIT 2:48pm (UTC): All Load balancers are now back in sync. Per service Availability :
- Load Balancing (dedicated instances) : OK
- Cellar Storage : OK
- Orchestration : OK but recovering a lot of runtime instances.
- API : OK
- Metrics API : KO (network topology split)
- Databases : Globally OK (individual situations being worked on)
- Monitoring : OK (in sync from a couple of minutes)
- Infrastructure overall load : high
EDIT 3:00pm (UTC): Deployments are done but still slow. Clever Cloud API is being restarted.
EDIT 3:05pm (UTC): Infrastructure load sanitization : 30%
EDIT 3:40pm (UTC): Overall situation is better. Still a few tousands VMs in the recovery queue.
EDIT 3:48pm (UTC): Most databases should be available. We're experiencing an additional delay with some encrypted databases.
EDIT 3:53pm (UTC): Infrastructure load sanitization : 100%
EDIT 4:15pm (UTC): Remaining Databases are recovered. Estimated total fix time is expected in the 10/15 next minutes.
EDIT 4:30pm (UTC): All applications should run fine (orchestration point of view). Orchestration monitoring is partially up and running (stuck apps will be quickly unstuck)
EDIT 5:10pm (UTC): All stuck applications should now be available (monitoring point of view).
EDIT 5:16pm (UTC): All databases are available
EDIT 5:44pm (UTC): Incident is considered closed (a few cases are still dealt with customers)