Past Incidents

Tuesday 26th March 2024

Metrics [GLOBAL] Metrics query unavailable

The metrics query is currently unavailable as some indexing shared are offline. We are working to get them up as quickly as possible. There is no impact on ingestion pipeline and storage layer.

EDIT 13:30 UTC : Indexing components are online and query is available

Monday 25th March 2024

Metrics [Metrics] Query instability

A cleanup process has triggered some durability lag on our storage layer. You may experience query instability.

Mon Mar 25 20:32:34 2024 UTC: all components are back to normal

Sunday 24th March 2024

No incidents reported

Saturday 23rd March 2024

No incidents reported

Friday 22nd March 2024

No incidents reported

Thursday 21st March 2024

Services Logs Logs drains are down

(times in UTC)

Around 21:00, a part of the logs drains stack broke in a way that our monitoring did not see right away. It started to fill up the disk of the underlying RabbitMQ. At 21:37, We were alerted by the lack of space on RabbitMQ. We started investigating it around 22:10. At 22:57: the log drain stack is back up! However, to fix the RabbitMQ, we had to drop the pending queues. Our logs are still collected in our new logs infrastructure, but all drains lost the logs between 21:00 and 22:57.

Cellar North: Requests slowness

We are currently investigating requests slowness on the Cellar north service.

EDIT 15:52 UTC: The issue has been identified and is being worked on. Timeouts should now be very sporadic since 15:38 UTC but some timeouts may still appear. We continue working on the issue.

EDIT 17:30 UTC: The service is now stable for the past hour, we will continue to monitor it for the next few hours.

[DEV] MTL cluster unavailable

The MySQL dev add-on cluster was unreachable. This should now be fixed

Wednesday 20th March 2024

Reverse Proxies [Global] Database load balancers maintenance

Scope:

Database Load Balancer (configuration update)

Expected Impact:

Brief disconnections or connection drops during the update process.
Potential minor performance fluctuations.

Additional Information:

We will deploy a patch on the load balancer that reduce memory consumption and enable more telemetry.
Please report any issues with a method for reproducing the problem
This maintenance is a direct follow up of the incident to propagate the patch https://www.clevercloudstatus.com/incident/826

EDIT 16:25 UTC : We have patched RBX, RBXHDS and MTL regions

EDIT 16:25 UTC : We are rolling out the patch on PAR region.

EDIT 16:45 UTC: We have patched PAR region, we start the WSW region

EDIT 16:55 UTC: We have patched WSW region, we start the SYD region

EDIT 17:05 UTC: We have patched SYD region, we start the GRAHDS region

EDIT 17:05 UTC: We have patched GRAHDS region, we start the SCW region

EDIT 17:15 UTC: We have patched SCW region, we start the SGP region.

EDIT 17:35 UTC: We have patched the SGP region as well. The maintenance is over.