Some systems are experiencing issues

Past Incidents

Friday 2nd August 2024

Infrastructure Global outage

We are experiencing a global outage. We observed a network split in addition to an event bus outage. The effect has been inpactful for some core services.

EDITS :

  • 2:00 PM CEST - Core services are being recovered and Deployments are being reloaded. This will synchronize back load balancers for customer's application trying to reach their new deployments.
  • 2:08 PM CEST - Some services are being shut to accelerate the recovery process. Expect disturbed experience for observability and deployments for a few minutes
  • 2:29 PM CEST - Criticial Core services are OK. Deployments are being rolled out.
  • 3:07 PM CEST - Some workload queues have still difficulties to be processed. Some components may still be in an unstable state. Current effort is to identify them, then reload them.
  • 3:40 PM CEST - Some hypervisors have experienced some crashes. Recovery process is occuring and will take a couple of minutes
  • 3:56 PM CEST - Some hypervisors seems still experiencing network issues.
  • 4:16 PM CEST - Apps are being deployed for premium customers. All apps are going to be deployed. Anyone can accelerate the process for its own application by manually deploying them.
  • 4:24 PM CEST - In the meantime, we continue to identify noisy VMs that have been impacted by the outage
  • 5:15 PM CEST - Metrics API is being restarted.
  • 6:20 PM CEST - Last deployments are being rolled out. Reminder : accelerate by triggering a redeploy action
  • 6:30 PM CEST - Still a few hundreds of VMs are consuming very high CPU rates and being cleaned.
  • 6:35 PM CEST - We estimate approximately 40min to have full recovered all deployment of applications (MANUALY REDEPLOY FOR FASTER RECOVERY)
  • 7:05 PM CEST - All IPSec links should be back online
Access Logs Access logs ingestion and processing unavailable

Following https://www.clevercloudstatus.com/incident/877, we have difficulties to process access logs, you may observe holes and lags.

Deployments Deployment failure are observed in PAR

Following https://www.clevercloudstatus.com/incident/877, some deployments are failing. We currently working on a solution.

EDIT: 10H31 UTC - A workaround has been found to ensure that deployments work again

Pulsar Pulsar connection issues

Connections issues (producers/consumes) during cluster upgrade
It can lead to fail in app redeployement

Thursday 1st August 2024

PostgreSQL [PostgreSQL] Trouble to create DEV add-on

Order of DEV add-on is currently locked. No impact on existing add-on.

We are investigating

[EDIT 12:00 CEST]: we have identied and fix the lock

Cellar CEPH-NORTH-HDS: rebalance in progress

Because of the hardware issue describe in https://www.clevercloudstatus.com/incident/874, we need to rebalance data on Cellar North. Customer may experience higher latency than usual.

Infrastructure GRA-HDS: Hypervisor unreachable

An hypervisor on the GRA-HDS region is unreachable. We are working on it.

EDIT Thu Aug 01 09:13:09 2024 UTC: hypervisor has been rebooted. A hardware issue has been detected. All applications have been redeployed and there was no customer databases on the hypervisor.

Wednesday 31st July 2024

No incidents reported

Tuesday 30th July 2024

No incidents reported

Monday 29th July 2024

No incidents reported

Sunday 28th July 2024

No incidents reported

Saturday 27th July 2024

No incidents reported