We are experiencing a global outage. We observed a network split in addition to an event bus outage. The effect has been inpactful for some core services.
EDITS :
2:00 PM CEST - Core services are being recovered and Deployments are being reloaded. This will synchronize back load balancers for customer's application trying to reach their new deployments.
2:08 PM CEST - Some services are being shut to accelerate the recovery process. Expect disturbed experience for observability and deployments for a few minutes
2:29 PM CEST - Criticial Core services are OK. Deployments are being rolled out.
3:07 PM CEST - Some workload queues have still difficulties to be processed. Some components may still be in an unstable state. Current effort is to identify them, then reload them.
3:40 PM CEST - Some hypervisors have experienced some crashes. Recovery process is occuring and will take a couple of minutes
3:56 PM CEST - Some hypervisors seems still experiencing network issues.
4:16 PM CEST - Apps are being deployed for premium customers. All apps are going to be deployed. Anyone can accelerate the process for its own application by manually deploying them.
4:24 PM CEST - In the meantime, we continue to identify noisy VMs that have been impacted by the outage
5:15 PM CEST - Metrics API is being restarted.
6:20 PM CEST - Last deployments are being rolled out. Reminder : accelerate by triggering a redeploy action
6:30 PM CEST - Still a few hundreds of VMs are consuming very high CPU rates and being cleaned.
6:35 PM CEST - We estimate approximately 40min to have full recovered all deployment of applications (MANUALY REDEPLOY FOR FASTER RECOVERY)
7:05 PM CEST - All IPSec links should be back online
Access LogsAccess logs ingestion and processing unavailable
Following https://www.clevercloudstatus.com/incident/877, we have difficulties to process access logs, you may observe holes and lags.
DeploymentsDeployment failure are observed in PAR
Following https://www.clevercloudstatus.com/incident/877, some deployments are failing.
We currently working on a solution.
EDIT: 10H31 UTC - A workaround has been found to ensure that deployments work again
PulsarPulsar connection issues
Connections issues (producers/consumes) during cluster upgrade
It can lead to fail in app redeployement
Thursday 1st August 2024
PostgreSQL[PostgreSQL] Trouble to create DEV add-on
Order of DEV add-on is currently locked.
No impact on existing add-on.
We are investigating
[EDIT 12:00 CEST]: we have identied and fix the lock
CellarCEPH-NORTH-HDS: rebalance in progress
Because of the hardware issue describe in https://www.clevercloudstatus.com/incident/874, we need to rebalance data on Cellar North. Customer may experience higher latency than usual.
InfrastructureGRA-HDS: Hypervisor unreachable
An hypervisor on the GRA-HDS region is unreachable. We are working on it.
EDIT Thu Aug 01 09:13:09 2024 UTC: hypervisor has been rebooted. A hardware issue has been detected. All applications have been redeployed and there was no customer databases on the hypervisor.