Some systems are experiencing issues

Past Incidents

Saturday 8th March 2025

No incidents reported

Friday 7th March 2025

Infrastructure [MEA] An hypervisor is rebooting

An hypervisor is rebooting on the MEA region, we are working to restart all services.

EDIT 16:44 CET: The hypervisor has rebooted and all services are available again.

Thursday 6th March 2025

No incidents reported

Wednesday 5th March 2025

No incidents reported

Tuesday 4th March 2025

Infrastructure [Gouv] Network reachability issue

We are experiencing a network reachability issue on the Gouv region. We are looking into it.

EDIT 13:22 CET: Our infrastructure provider acknowledged the incident and is working on it.

EDIT 13:46 CET: Our infrastructure provider continues to investigate the issue.

EDIT 13:52 CET: The whole region is impacted, no service hosted on that can be reached.

EDIT 14:17 CET: A fix has been implemented on our infrastructure provider side. We regained access to the infrastructure since 14:08. We are making sure all services are restarted.

EDIT 14:32 CET: All services have been restarted, we keep watching if anything comes up.

EDIT 14:40 CET: All KMS nodes are fully operational

Monday 3rd March 2025

Infrastructure [PAR] Power supply failure on region EU-FR-1 (Paris)

Clever Cloud Incident – Explanations and Lessons Learned

Today we experienced an incident affecting our infrastructure. Here is a summary of the causes, ongoing analysis, and planned actions to strengthen our resilience.

Incident Timeline:

  1. Power Outage: A power failure reduced our computing capacity by one-third, highlighting the need to expand our infrastructure to five datacenters to better absorb such incidents.
  2. Network Issue: A network outage related to BGP announcements followed, revealing an underlying issue that requires further investigation to prevent recurrence.

Why Did Recovery Take Time?

  1. Machine Reconnection: The corrective measure to prevent overload during VM reconnection to Pulsar was not fully effective. An in-depth analysis is underway to improve this process.
  2. Orchestration Evolution: Our current system is reaching its limits. We are working on a new orchestration architecture to better manage recovery and optimize performance.

Next Steps:

We will publish a detailed post-mortem and schedule meetings with customers to:

  • Analyze the incident,
  • Explain upcoming changes,
  • Demonstrate our commitment to improving infrastructure resilience.

We will keep you informed about these actions. Thank you for your patience and trust.

The Clever Cloud Team

Detailed timeline :

EDIT 2:11pm (UTC): One of the Paris datacenters has experienced an electricity issue. Some hypervisors have been rebooted.

EDIT 2:21pm (UTC): Applications are being restarted, the situation is stabilizing.

EDIT 2:28pm (UTC): Monitoring is up and generating new statuses

EDIT 2:34pm (UTC): Recovering process is still ongoing. Customers can open a ticket though their email endpoint in addition to the Web Console.

EDIT 2:43pm (UTC): Infrastructure is under high load. We're accelerating the recovery process with load sanitization

EDIT 2:48pm (UTC): All Load balancers are now back in sync. Per service Availability :

  • Load Balancing (dedicated instances) : OK
  • Cellar Storage : OK
  • Orchestration : OK but recovering a lot of runtime instances.
  • API : OK
  • Metrics API : KO (network topology split)
  • Databases : Globally OK (individual situations being worked on)
  • Monitoring : OK (in sync from a couple of minutes)
  • Infrastructure overall load : high

EDIT 3:00pm (UTC): Deployments are done but still slow. Clever Cloud API is being restarted.

EDIT 3:05pm (UTC): Infrastructure load sanitization : 30 %

EDIT 3:40pm (UTC): Overall situation is better. Still a few tousands VMs in the recovery queue.

EDIT 3:48pm (UTC): Most databases should be available. We're experiencing an additional delay with some encrypted databases.

EDIT 3:53pm (UTC): Infrastructure load sanitization : 100 %

EDIT 4:15pm (UTC): Remaining Databases are recovered. Estimated total fix time is expected in the 10/15 next minutes.

EDIT 4:30pm (UTC): All applications should run fine (orchestration point of view). Orchestration monitoring is partially up and running (stuck apps will be quickly unstuck)

EDIT 5:10pm (UTC): All stuck applications should now be available (monitoring point of view).

EDIT 5:16pm (UTC): All databases are available

EDIT 5:44pm (UTC): Incident is considered closed (a few cases are still dealt with customers)

Sunday 2nd March 2025

No incidents reported