No incidents reported

Infrastructure [Gouv] Network reachability issue

We are experiencing a network reachability issue on the Gouv region. We are looking into it.

EDIT 13:22 CET: Our infrastructure provider acknowledged the incident and is working on it.

EDIT 13:46 CET: Our infrastructure provider continues to investigate the issue.

EDIT 13:52 CET: The whole region is impacted, no service hosted on that can be reached.

EDIT 14:17 CET: A fix has been implemented on our infrastructure provider side. We regained access to the infrastructure since 14:08. We are making sure all services are restarted.

EDIT 14:32 CET: All services have been restarted, we keep watching if anything comes up.

EDIT 14:40 CET: All KMS nodes are fully operational

Infrastructure [PAR] Power supply failure on region EU-FR-1 (Paris)

#Clever Cloud Incident – Explanations and Lessons Learned

Today we experienced an incident affecting our infrastructure. Here is a summary of the causes, ongoing analysis, and planned actions to strengthen our resilience.

Incident Timeline:

Power Outage: A power failure reduced our computing capacity by one-third, highlighting the need to expand our infrastructure to five datacenters to better absorb such incidents.
Network Issue: A network outage related to BGP announcements followed, revealing an underlying issue that requires further investigation to prevent recurrence.

Why Did Recovery Take Time?

Machine Reconnection: The corrective measure to prevent overload during VM reconnection to Pulsar was not fully effective. An in-depth analysis is underway to improve this process.
Orchestration Evolution: Our current system is reaching its limits. We are working on a new orchestration architecture to better manage recovery and optimize performance.

Next Steps:

We will publish a detailed post-mortem and schedule meetings with customers to:

Analyze the incident,
Explain upcoming changes,
Demonstrate our commitment to improving infrastructure resilience.

We will keep you informed about these actions. Thank you for your patience and trust.

The Clever Cloud Team

Detailed timeline :

EDIT 2:11pm (UTC): One of the Paris datacenters has experienced an electricity issue. Some hypervisors have been rebooted.

EDIT 2:21pm (UTC): Applications are being restarted, the situation is stabilizing.

EDIT 2:28pm (UTC): Monitoring is up and generating new statuses

EDIT 2:34pm (UTC): Recovering process is still ongoing. Customers can open a ticket though their email endpoint in addition to the Web Console.

EDIT 2:43pm (UTC): Infrastructure is under high load. We're accelerating the recovery process with load sanitization

EDIT 2:48pm (UTC): All Load balancers are now back in sync. Per service Availability :

Load Balancing (dedicated instances) : OK
Cellar Storage : OK
Orchestration : OK but recovering a lot of runtime instances.
API : OK
Metrics API : KO (network topology split)
Databases : Globally OK (individual situations being worked on)
Monitoring : OK (in sync from a couple of minutes)
Infrastructure overall load : high

EDIT 3:00pm (UTC): Deployments are done but still slow. Clever Cloud API is being restarted.

EDIT 3:05pm (UTC): Infrastructure load sanitization : 30%

EDIT 3:40pm (UTC): Overall situation is better. Still a few tousands VMs in the recovery queue.

EDIT 3:48pm (UTC): Most databases should be available. We're experiencing an additional delay with some encrypted databases.

EDIT 3:53pm (UTC): Infrastructure load sanitization : 100%

EDIT 4:15pm (UTC): Remaining Databases are recovered. Estimated total fix time is expected in the 10/15 next minutes.

EDIT 4:30pm (UTC): All applications should run fine (orchestration point of view). Orchestration monitoring is partially up and running (stuck apps will be quickly unstuck)

EDIT 5:10pm (UTC): All stuck applications should now be available (monitoring point of view).

EDIT 5:16pm (UTC): All databases are available

EDIT 5:44pm (UTC): Incident is considered closed (a few cases are still dealt with customers)

No incidents reported

Infrastructure PAR: An hypervisor is unreachable

An Hypervisor in the PAR region is unreachable, we are working on it.

EDIT 16:30 UTC+1: This hypervisor has faced a network issue making it unreachable during several minutes. It is now up and running.

EDIT 19:15 UTC+1: The incident is over.

MySQL shared cluster [MTL] Dev Cluster Migration

[2025-02-27] 12:16UTC We are deploying a new shared cluster for Dev mysql 8.4 addon

[2025-02-27] 11:40UTC New mysql cluster successfully deployed

[2025-02-27] 13:30UTC We are deploying a new shared cluster for Dev postgres 15 addon

[2025-02-28] 09:17UTC New postgresql cluster successfully deployed

Past Incidents

Wednesday 5th March 2025

Tuesday 4th March 2025

Monday 3rd March 2025

Incident Timeline:

Why Did Recovery Take Time?

Next Steps:

Detailed timeline :

Sunday 2nd March 2025

Saturday 1st March 2025

Friday 28th February 2025

Thursday 27th February 2025