Over the past few days, our platform encountered several glitches in the handling of connections, with some of our customers experiencing slowdowns in some services. Here are the results of our investigations and the actions taken by our teams:
Update 2023-12-01 18:00 UTC
Cellar
After running more tests, we discovered performance issues on long-distance connections, possibly caused by HTTP/2, which we activated on Cellar a few weeks ago. Our analyses confirmed that uploading data to Cellar using HTTP/2 in such conditions could heavily limit the throughput, whereas HTTP/1.1 gave us better and consistent results.
The improvements seen for customers affected by the identified problems far outweigh the benefits of HTTP/2 seen in few cases.
So we're disabling HTTP/2 and monitoring throughput to confirm this on a larger scale.
Update 2023-11-28 13:30 UTC
Load balancers 🥁
We will begin to include new load balancer instances deployed yesterday in the load balancer pool starting 14:00 UTC.
New load balancer IP addresses that will be added with the current ones are :
- 91.208.207.214
- 91.208.207.215
- 91.208.207.216
- 91.208.207.217
- 91.208.207.218
EDIT 15:30 UTC : The monitoring saw an increasing number of 404 response status code. We rolled back the modification and investigate the issue. It was an overlapping of internal ip addresses with the cellar load balancer which is fixed now.
EDIT 15:45 UTC : After further investigation, we could resume the maintenance.
EDIT 18:05 UTC : We have finished to deploy new instances.
Update 2023-11-27 17:20 UTC
Load balancers 🥁
We have installed new load balancers. We will review and test them tonight and will add them to the lb pool tomorrow morning (2023-11-28).
UPDATE 2023-11-27 15:30 UTC
Load balancers 👀
We are still seeing a few random SSL errors here and there. We are investigating. The culprit may be a lack of allocated resources. We are following this lead.
… we have fine tuned the load balancers, which have caused temporary more SSL Errors for a minute. The traffic seems to be better.
UPDATE 2023-11-27 14:00 UTC
Load balancers ❌
We are experiencing new errors on the load balancers: customers report PR_END_OF_FILE_ERROR
errors in their browsers while connecting to their apps and SSL_ERROR_SYSCALL
from curl.
We are able to reproduce these errors. They look like the incident from friday 24th in the morning. We are looking for the configuration misshap that may have escaped our review.
✅ It's fixed. We started to write a monitoring script for that kind of configuration error, we will speed up the writing and the deployment of this monitoring in production.
UPDATE 2023-11-27 08:30 UTC
Load balancers ✅
We've been monitoring the load balancers all week-end: The only desync was observed (and fixed right away by the on-call team) on old sōzu versions (0.13) that are still processing 10% of Paris' public traffic!
We plan to remove these old load balancers quickly this week.
We consider the desynchronization issue resolved.
Cellar 👀
Last Friday, we configured Cellar's front proxies to lower their reload rate. We haven't seen any slowness since, but it was already hard to reproduce on our side.
No slowness on Cellar was reported during the week-end, but we are still on the look.
UPDATE 2023-11-24 15:00 UTC
Load balancers 🥁
After more (successful) load tests, the new version of sōzu (0.15.17) is being installed on all impacted public and private load balancers. Upgrades should be over in the next two hours.
Cellar 👀
The team continues to investigate the random slowness issues still encountered by some customers, which we are trying to reproduce in a consistent way.
UPDATE 2023-11-24 10:45 UTC
Load balancers 🥁
We've tested our new Sōzu release (0.15.17) all night with extra monitoring and no lag or crash was detected.
The only remaining issues were on the non updated (0.13.6) instances. They were detected by our monitoring and the on-call team restarted them.
We are pretty confident that this new release solves our load balancers issues. We plan to switch all private and public Sōzu load balancers to 0.15.17 today and monitor them over the coming days.
Temporary incident:
While updating our configuration to grow the traffic shares of the new (0.15.17) load balancers, a human mistake (and not a newly discovered bug) broke part of the configuration, causing many ssl version errors on 15% of the requests between 09:25 and 09:50 UTC.
UPDATE 2023-11-23 18:43 UTC
Certificates ✅
As we planed earlier, the renewal of all certificates in RSA 2048 has been completed, except for a few wildcards (mostly ours) which require manual intervention. This will be dealt with shortly.
Load balancers 🛥️
We were able to identify the root cause of our desync/lag in Sōzu. A specific request, a ‘double bug’, was causing worker crashes. We developed fixes and are confident they will fix our problems. We’ll test them and be monitoring the situation before deploying them fully in production.
Cellar 👀
We’ve upgraded our load balancers infrastructure and monitoring tools to check whether this will improve the various types of problems reported to us.
Original Status (2023-11-23 14:12 UTC)
1. Key management in Sōzu and security standards
Background: Two months ago, we migrated our Let’s Encrypt aumotatic certificate generation from RSA 2048 keys to RSA 4096 keys. Following a major certificates renewal in early November, this led to timeouts when processing requests, and then 504 errors.
Actions:
- On Monday November 13, we rolled back key generation to RSA 2048 for all new certificates.
- On Monday November 20, we launched a complete key regeneration in RSA 2048, which requires an increase in our Let's Encrypt quotas (in progress).
Back to normal: Within the day, while we finish regeneration.
Next steps: We have also explored a migration to the ECDSA standard, which according to our initial tests will enable us to improve both the performance and security levels of our platform. Such a migration will be planned in the coming months, after a deeper impact analysis.
2. HTTPS performance issues
Background: We noted a significant drop in HTTPS request processing performance, with capacity reduced from 8,000 to 4,000 requests per second, due in particular to an excessive number of syscalls via rustls.
Actions: We developed a Sōzu update and pushed it on November 16.
Back to normal: The problem is now resolved.
3. Load balancers desync/lag
Background: Load balancers are sometimes out of sync, Sōzu gets stuck in TLS handshakes or requests. The workers no longer take the config updates, causing the proxy-manager to freeze. The load balancers then miss all new config updates until we restart them.
Actions: We have improved our tooling to detect the root cause of the problem at a deeper level. We have been able to confirm that this concerns both Sōzu versions 0.13.x and 0.15.x.
Next steps: We'll be tracing the problem in greater depth within the day, to decide what actions to take in the short term to mitigate the problem.
4. Random slowness on Cellar
Background: Customers are reporting slowness or timeouts on Cellar, which we are now able to identify and qualify. If the cause has not been fully spotted, we have several ways of mitigating the problem.
Actions: Add capacity to front-ends infrastructure and enhance network configuration.