#Clever Cloud Incident – Explanations and Lessons Learned
Today we experienced an incident affecting our infrastructure. Here is a summary of the causes, ongoing analysis, and planned actions to strengthen our resilience.
We will publish a detailed post-mortem and schedule meetings with customers to:
We will keep you informed about these actions. Thank you for your patience and trust.
The Clever Cloud Team
EDIT 2:11pm (UTC): One of the Paris datacenters has experienced an electricity issue. Some hypervisors have been rebooted.
EDIT 2:21pm (UTC): Applications are being restarted, the situation is stabilizing.
EDIT 2:28pm (UTC): Monitoring is up and generating new statuses
EDIT 2:34pm (UTC): Recovering process is still ongoing. Customers can open a ticket though their email endpoint in addition to the Web Console.
EDIT 2:43pm (UTC): Infrastructure is under high load. We're accelerating the recovery process with load sanitization
EDIT 2:48pm (UTC): All Load balancers are now back in sync. Per service Availability :
EDIT 3:00pm (UTC): Deployments are done but still slow. Clever Cloud API is being restarted.
EDIT 3:05pm (UTC): Infrastructure load sanitization : 30%
EDIT 3:40pm (UTC): Overall situation is better. Still a few tousands VMs in the recovery queue.
EDIT 3:48pm (UTC): Most databases should be available. We're experiencing an additional delay with some encrypted databases.
EDIT 3:53pm (UTC): Infrastructure load sanitization : 100%
EDIT 4:15pm (UTC): Remaining Databases are recovered. Estimated total fix time is expected in the 10/15 next minutes.
EDIT 4:30pm (UTC): All applications should run fine (orchestration point of view). Orchestration monitoring is partially up and running (stuck apps will be quickly unstuck)
EDIT 5:10pm (UTC): All stuck applications should now be available (monitoring point of view).
EDIT 5:16pm (UTC): All databases are available
EDIT 5:44pm (UTC): Incident is considered closed (a few cases are still dealt with customers)