Some systems are experiencing issues

Past Incidents

Wednesday 29th November 2023

API API seems to be slow

Our main API responds slowly. We are investigating to find out why.

EDIT 19h UTC : The issue has been solved

[Paris] Datacenter update, scheduled 9 months ago

We are planning to do various updates on one of our datacenter in the Paris region starting at 13:30 UTC. It will last for a few hours. No issue is to be expected during this window.

We will update this status accordingly.

EDIT 17:30 UTC: All updates are now over. Operations went smoothly and no impact was detected.

Tuesday 28th November 2023

No incidents reported

Monday 27th November 2023

Reverse Proxies TCP redirections unavailable

In our efforts to fix the issues listed in this status, we fully moved our trafic from the old LB running sōzu 0.13 to new LBs running sōzu 0.15 at 13:30 UTC.

While performing the move, a network configuration issue arose, impacting only customers using TCP redirections on the PAR region.

As the team was focused on monitoring and fine-tuning the configuration of the new LB, it failed to see the error reports until 14:30 UTC. To prevent such an incident in the future, we have since improved our monitoring and alert tools for TCP redirects.

The issue was fixed by 14:55 UTC.

Trouble to access addons Metrics

We detected an issue to both access and ingest Clever-cloud applications and addons metrics.

We are investigating.

Edit 27 Nov 2023 11:02:23: Query is now functional. We are also observing an issue with metrics from add-ons. We are on it. Edit 27 Nov 2023 06:00 PM: A regression on token's regen has been fixed, and all tokens have been updated.

[Paris] Datacenter update, scheduled 9 months ago

We are planning to do various updates on one of our datacenter in the Paris region starting at 14:00 UTC. It will last for a few hours. No issue is to be expected during this window.

We will update this status accordingly.

EDIT 23:15 UTC: All updates are now over. Operations went smoothly and no impact was detected.

Sunday 26th November 2023

No incidents reported

Saturday 25th November 2023

No incidents reported

Friday 24th November 2023

Cellar [PAR] Cellar C2 is unreachable

During the deployment of maintenance update to solve the https://www.clevercloudstatus.com/incident/767, we applied a patch that put cellar into an unreachable state. We are currently rolling back the update.

EDIT 17:00 UTC : Cellar is available

Thursday 23rd November 2023

Reverse Proxies Load balancers issues and stability : summary and our actions

Over the past few days, our platform encountered several glitches in the handling of connections, with some of our customers experiencing slowdowns in some services. Here are the results of our investigations and the actions taken by our teams:

Update 2023-12-01 18:00 UTC

Cellar

After running more tests, we discovered performance issues on long-distance connections, possibly caused by HTTP/2, which we activated on Cellar a few weeks ago. Our analyses confirmed that uploading data to Cellar using HTTP/2 in such conditions could heavily limit the throughput, whereas HTTP/1.1 gave us better and consistent results. The improvements seen for customers affected by the identified problems far outweigh the benefits of HTTP/2 seen in few cases. So we're disabling HTTP/2 and monitoring throughput to confirm this on a larger scale.

Update 2023-11-28 13:30 UTC

Load balancers 🥁

We will begin to include new load balancer instances deployed yesterday in the load balancer pool starting 14:00 UTC. New load balancer IP addresses that will be added with the current ones are :

  • 91.208.207.214
  • 91.208.207.215
  • 91.208.207.216
  • 91.208.207.217
  • 91.208.207.218

EDIT 15:30 UTC : The monitoring saw an increasing number of 404 response status code. We rolled back the modification and investigate the issue. It was an overlapping of internal ip addresses with the cellar load balancer which is fixed now.

EDIT 15:45 UTC : After further investigation, we could resume the maintenance.

EDIT 18:05 UTC : We have finished to deploy new instances.

Update 2023-11-27 17:20 UTC

Load balancers 🥁

We have installed new load balancers. We will review and test them tonight and will add them to the lb pool tomorrow morning (2023-11-28).

UPDATE 2023-11-27 15:30 UTC

Load balancers 👀

We are still seeing a few random SSL errors here and there. We are investigating. The culprit may be a lack of allocated resources. We are following this lead.

… we have fine tuned the load balancers, which have caused temporary more SSL Errors for a minute. The traffic seems to be better.

UPDATE 2023-11-27 14:00 UTC

Load balancers ❌

We are experiencing new errors on the load balancers: customers report PR_END_OF_FILE_ERROR errors in their browsers while connecting to their apps and SSL_ERROR_SYSCALL from curl. We are able to reproduce these errors. They look like the incident from friday 24th in the morning. We are looking for the configuration misshap that may have escaped our review.

✅ It's fixed. We started to write a monitoring script for that kind of configuration error, we will speed up the writing and the deployment of this monitoring in production.

UPDATE 2023-11-27 08:30 UTC

Load balancers ✅

We've been monitoring the load balancers all week-end: The only desync was observed (and fixed right away by the on-call team) on old sōzu versions (0.13) that are still processing 10% of Paris' public traffic! We plan to remove these old load balancers quickly this week.

We consider the desynchronization issue resolved.

Cellar 👀

Last Friday, we configured Cellar's front proxies to lower their reload rate. We haven't seen any slowness since, but it was already hard to reproduce on our side. No slowness on Cellar was reported during the week-end, but we are still on the look.

UPDATE 2023-11-24 15:00 UTC

Load balancers 🥁

After more (successful) load tests, the new version of sōzu (0.15.17) is being installed on all impacted public and private load balancers. Upgrades should be over in the next two hours.

Cellar 👀

The team continues to investigate the random slowness issues still encountered by some customers, which we are trying to reproduce in a consistent way.

UPDATE 2023-11-24 10:45 UTC

Load balancers 🥁

We've tested our new Sōzu release (0.15.17) all night with extra monitoring and no lag or crash was detected. The only remaining issues were on the non updated (0.13.6) instances. They were detected by our monitoring and the on-call team restarted them.

We are pretty confident that this new release solves our load balancers issues. We plan to switch all private and public Sōzu load balancers to 0.15.17 today and monitor them over the coming days.

Temporary incident:

While updating our configuration to grow the traffic shares of the new (0.15.17) load balancers, a human mistake (and not a newly discovered bug) broke part of the configuration, causing many ssl version errors on 15% of the requests between 09:25 and 09:50 UTC.

UPDATE 2023-11-23 18:43 UTC

Certificates ✅

As we planed earlier, the renewal of all certificates in RSA 2048 has been completed, except for a few wildcards (mostly ours) which require manual intervention. This will be dealt with shortly.

Load balancers 🛥️

We were able to identify the root cause of our desync/lag in Sōzu. A specific request, a ‘double bug’, was causing worker crashes. We developed fixes and are confident they will fix our problems. We’ll test them and be monitoring the situation before deploying them fully in production.

Cellar 👀

We’ve upgraded our load balancers infrastructure and monitoring tools to check whether this will improve the various types of problems reported to us.

Original Status (2023-11-23 14:12 UTC)

1. Key management in Sōzu and security standards

Background: Two months ago, we migrated our Let’s Encrypt aumotatic certificate generation from RSA 2048 keys to RSA 4096 keys. Following a major certificates renewal in early November, this led to timeouts when processing requests, and then 504 errors.

Actions:

  • On Monday November 13, we rolled back key generation to RSA 2048 for all new certificates.
  • On Monday November 20, we launched a complete key regeneration in RSA 2048, which requires an increase in our Let's Encrypt quotas (in progress).

Back to normal: Within the day, while we finish regeneration.

Next steps: We have also explored a migration to the ECDSA standard, which according to our initial tests will enable us to improve both the performance and security levels of our platform. Such a migration will be planned in the coming months, after a deeper impact analysis.

2. HTTPS performance issues

Background: We noted a significant drop in HTTPS request processing performance, with capacity reduced from 8,000 to 4,000 requests per second, due in particular to an excessive number of syscalls via rustls.

Actions: We developed a Sōzu update and pushed it on November 16.

Back to normal: The problem is now resolved.

3. Load balancers desync/lag

Background: Load balancers are sometimes out of sync, Sōzu gets stuck in TLS handshakes or requests. The workers no longer take the config updates, causing the proxy-manager to freeze. The load balancers then miss all new config updates until we restart them.

Actions: We have improved our tooling to detect the root cause of the problem at a deeper level. We have been able to confirm that this concerns both Sōzu versions 0.13.x and 0.15.x.

Next steps: We'll be tracing the problem in greater depth within the day, to decide what actions to take in the short term to mitigate the problem.

4. Random slowness on Cellar

Background: Customers are reporting slowness or timeouts on Cellar, which we are now able to identify and qualify. If the cause has not been fully spotted, we have several ways of mitigating the problem.

Actions: Add capacity to front-ends infrastructure and enhance network configuration.