No incidents reported

Infrastructure [MEA] An hypervisor is rebooting

An hypervisor is rebooting on the MEA region, we are working to restart all services.

EDIT 16:44 CET: The hypervisor has rebooted and all services are available again.

No incidents reported

Infrastructure [Gouv] Network reachability issue

We are experiencing a network reachability issue on the Gouv region. We are looking into it.

EDIT 13:22 CET: Our infrastructure provider acknowledged the incident and is working on it.

EDIT 13:46 CET: Our infrastructure provider continues to investigate the issue.

EDIT 13:52 CET: The whole region is impacted, no service hosted on that can be reached.

EDIT 14:17 CET: A fix has been implemented on our infrastructure provider side. We regained access to the infrastructure since 14:08. We are making sure all services are restarted.

EDIT 14:32 CET: All services have been restarted, we keep watching if anything comes up.

EDIT 14:40 CET: All KMS nodes are fully operational

Infrastructure [PAR] Power supply failure on region EU-FR-1 (Paris)

Clever Cloud Incident – Explanations and Lessons Learned

Today we experienced an incident affecting our infrastructure. Here is a summary of the causes, ongoing analysis, and planned actions to strengthen our resilience.

Incident Timeline:

Power Outage: A power failure reduced our computing capacity by one-third, highlighting the need to expand our infrastructure to five datacenters to better absorb such incidents.
Network Issue: A network outage related to BGP announcements followed, revealing an underlying issue that requires further investigation to prevent recurrence.

Why Did Recovery Take Time?

Machine Reconnection: The corrective measure to prevent overload during VM reconnection to Pulsar was not fully effective. An in-depth analysis is underway to improve this process.
Orchestration Evolution: Our current system is reaching its limits. We are working on a new orchestration architecture to better manage recovery and optimize performance.

Next Steps:

We will publish a detailed post-mortem and schedule meetings with customers to:

Analyze the incident,
Explain upcoming changes,
Demonstrate our commitment to improving infrastructure resilience.

We will keep you informed about these actions. Thank you for your patience and trust.

The Clever Cloud Team

Detailed timeline :

EDIT 2:11pm (UTC): One of the Paris datacenters has experienced an electricity issue. Some hypervisors have been rebooted.

EDIT 2:21pm (UTC): Applications are being restarted, the situation is stabilizing.

EDIT 2:28pm (UTC): Monitoring is up and generating new statuses

EDIT 2:34pm (UTC): Recovering process is still ongoing. Customers can open a ticket though their email endpoint in addition to the Web Console.

EDIT 2:43pm (UTC): Infrastructure is under high load. We're accelerating the recovery process with load sanitization

EDIT 2:48pm (UTC): All Load balancers are now back in sync. Per service Availability :

Load Balancing (dedicated instances) : OK
Cellar Storage : OK
Orchestration : OK but recovering a lot of runtime instances.
API : OK
Metrics API : KO (network topology split)
Databases : Globally OK (individual situations being worked on)
Monitoring : OK (in sync from a couple of minutes)
Infrastructure overall load : high

EDIT 3:00pm (UTC): Deployments are done but still slow. Clever Cloud API is being restarted.

EDIT 3:05pm (UTC): Infrastructure load sanitization : 30 %

EDIT 3:40pm (UTC): Overall situation is better. Still a few tousands VMs in the recovery queue.

EDIT 3:48pm (UTC): Most databases should be available. We're experiencing an additional delay with some encrypted databases.

EDIT 3:53pm (UTC): Infrastructure load sanitization : 100 %

EDIT 4:15pm (UTC): Remaining Databases are recovered. Estimated total fix time is expected in the 10/15 next minutes.

EDIT 4:30pm (UTC): All applications should run fine (orchestration point of view). Orchestration monitoring is partially up and running (stuck apps will be quickly unstuck)

EDIT 5:10pm (UTC): All stuck applications should now be available (monitoring point of view).

EDIT 5:16pm (UTC): All databases are available

EDIT 5:44pm (UTC): Incident is considered closed (a few cases are still dealt with customers)

No incidents reported

Past Incidents

Saturday 8th March 2025

Friday 7th March 2025

Thursday 6th March 2025

Wednesday 5th March 2025

Tuesday 4th March 2025

Monday 3rd March 2025

Clever Cloud Incident – Explanations and Lessons Learned

Incident Timeline:

Why Did Recovery Take Time?

Next Steps:

Detailed timeline :

Sunday 2nd March 2025