Some systems are experiencing issues

Past Incidents

Thursday 8th August 2024

No incidents reported

Wednesday 7th August 2024

No incidents reported

Tuesday 6th August 2024

No incidents reported

Monday 5th August 2024

No incidents reported

Sunday 4th August 2024

No incidents reported

Saturday 3rd August 2024

OVH Regions are impacted with the provider network backbone issues

We may be impacted by https://network.status-ovhcloud.com/incidents/nnhpfdw50vsn which we are investigating. Only OVH regions based services are concerned.

Update 13:33 UTC: we are indeed impacted by OVHcloud's Backbone incident. Some network routes cannot reach OVHcloud's datacenters. We are working on it. More info can be found on https://x.com/olesovhcom/status/1819742478586528146

Update 14:42 UTC: network seems more reliable now. We are still watching the network links

Update 15:16 UTC: The services are getting operational according to OVHcloud and we are not seeing network issues anymore.

Customer support Premium astreinte telecom is unreliable

Phone number for on duty call of some customer experience a problem in our provider of telecommunication subsystems. Phone rings, but after there is impossible to talk on the phone. Customers with a problem, need to send directly by mail [email protected] And they will be called back.

Friday 2nd August 2024

Infrastructure Global outage

We are experiencing a global outage. We observed a network split in addition to an event bus outage. The effect has been inpactful for some core services.

EDITS :

  • 2:00 PM CEST - Core services are being recovered and Deployments are being reloaded. This will synchronize back load balancers for customer's application trying to reach their new deployments.
  • 2:08 PM CEST - Some services are being shut to accelerate the recovery process. Expect disturbed experience for observability and deployments for a few minutes
  • 2:29 PM CEST - Criticial Core services are OK. Deployments are being rolled out.
  • 3:07 PM CEST - Some workload queues have still difficulties to be processed. Some components may still be in an unstable state. Current effort is to identify them, then reload them.
  • 3:40 PM CEST - Some hypervisors have experienced some crashes. Recovery process is occuring and will take a couple of minutes
  • 3:56 PM CEST - Some hypervisors seems still experiencing network issues.
  • 4:16 PM CEST - Apps are being deployed for premium customers. All apps are going to be deployed. Anyone can accelerate the process for its own application by manually deploying them.
  • 4:24 PM CEST - In the meantime, we continue to identify noisy VMs that have been impacted by the outage
  • 5:15 PM CEST - Metrics API is being restarted.
  • 6:20 PM CEST - Last deployments are being rolled out. Reminder : accelerate by triggering a redeploy action
  • 6:30 PM CEST - Still a few hundreds of VMs are consuming very high CPU rates and being cleaned.
  • 6:35 PM CEST - We estimate approximately 40min to have full recovered all deployment of applications (MANUALY REDEPLOY FOR FASTER RECOVERY)
  • 7:05 PM CEST - All IPSec links should be back online
Access Logs Access logs ingestion and processing unavailable

Following https://www.clevercloudstatus.com/incident/877, we have difficulties to process access logs, you may observe holes and lags.

Deployments Deployment failure are observed in PAR

Following https://www.clevercloudstatus.com/incident/877, some deployments are failing. We currently working on a solution.

EDIT: 10H31 UTC - A workaround has been found to ensure that deployments work again

Pulsar Pulsar connection issues

Connections issues (producers/consumes) during cluster upgrade
It can lead to fail in app redeployement