Wednesday 16th June 2021

Infrastructure PAR: Network accessibility issue

Post Mortem

(The original incident text can be found at the end)

A network issue caused 17 minutes of full unreachability of the Paris zone which in turn caused some applications to go down and our deployment system to slow down while restarting affected applications as well as several other services.

Timeline

10:12 UTC: The whole PAR network is unreachable from outside, cross-datacenter network is down as well.

10:16 UTC: The on-call team is warned by an external monitoring system.

10:21 UTC: Our network provider informs us that they are aware of the issue.

10:29 UTC: The network is back.

10:30 UTC: The monitoring systems are starting to queue a lot of deployments. The load of one monitoring system in charge of one of the PAR datacenters increases significantly. Other systems such as Logs, Metrics, and Access Logs (collection and query) are also impacted and unavailable. Some applications relying on FSBucket services (mostly PHP applications) are also having communication issues with their FSBuckets. This might have made some applications unreachable and their I/O very high, sometimes leading to Monitoring/Scaling deployments. This particular issue was detected later during the incident.

10:35 UTC: Our network provider confirms to us that the issue is fixed.

10:50 UTC: Deployments are slow to start because many of them are in queue.

11:00 UTC: The load of the faulty monitoring system being too high causes it to see more applications down than there actually are, and to queue even more deployments for applications that were actually reachable.

11:15 UTC: Clever Cloud Metrics is back, delayed data points have been ingested. Writing to the ingestion queue is still subject to problems.

11:20 UTC: We notice the build cache management system is overloaded, slowing down deployments and failing those that rely on the build cache feature. The retrying of these failed deployments adds even more items to the deployment queue.

11:28 UTC: We start upscaling the build cache management system beyond its original maximum setting.

11:52 UTC: We believe an issue found in the past few days within the build cache management system is responsible for the slowness/unreachability of the build cache service. This issue caused a thread leak which had been triggering more upscalings than usual. A fix was being tested on our testing environment but was not yet validated. We try to push this fix to production.

12:48 UTC: The fix pushed to production at 11:52 UTC is not effective. We upscale the build cache management system again.

13:00 UTC: Logs collection is back. Logs collected before this time were lost. Queries are also available but might still fail sometimes or return delayed logs.

13:05 UTC: We prevent the overloaded monitoring system from queuing up more deployments and empty out its internal alerting queue.

13:10 UTC: We rollback a change made on the database a few days ago, which we believe is the root cause of the ongoing issue.

13:16 UTC: The build cache management system database load starts to go up. This is caused by the application being more effective at making requests to the database thanks to the previous rollback.

13:18 UTC: The build cache management system database is overloaded.

13:33 UTC: We start looking into optimizing requests and clearing up stale data.

13:59 UTC: We manage to bring the build cache management system database load down.

14:05 UTC: The build cache management system is still overloaded/slow despite its database now working properly. A deployment is queued with an environment config change but is slow to start. We restart the application manually to apply this change.

14:10 UTC: The change of configuration is effective, the deployment queue starts to empty itself but there are still a lot of deployments in the queue.

14:15 UTC: An older deployment, performed without the environment change which was waiting to be processed, finishes successfully, leading to about half of the build cache requests failing.

14:17 UTC: We start reapplying the fix manually on live instances while a new deployment with the correct environment is started. The deployment queue size is going down.

14:29 UTC: The deployment queue is filling up again.

14:53 UTC: We realize the faulty monitoring system is still queuing deployments despite its alerting queue being empty and the alerting action being disabled.

14:57 UTC: We completely restart the faulty monitoring system and make sure it stops queuing deployments.

15:10 UTC: We are now certain the previously faulty monitoring system stopped queuing deployments for false positives. The deployment queue is back to normal and the deployment system is more reactive.

15:15 UTC: We start cleaning stuck deployments and making sure everything is working fine.

15:42 UTC: We start redeploying all Paris PHP applications which have not been deployed since the network came back.

16:00 UTC: Some PHP deployments seem to be failing due to a connection timeout to their PHP session stored on an FSBucket. We abort the PHP deployment queue to avoid any more errors.

16:10 UTC: The connection was only broken on one hypervisor and is now fixed. We also make sure every other hypervisor can contact all FSBucket servers on the PAR zone.

16:15 UTC: The PHP deployments queue is started again, with a lower delay between deployments.

16:42 UTC: Clever Cloud Metrics / Access logs ingestion is now fixed. Queries should be returning up-to-date data. Access logs were stored in a different queue and have been entirely consumed.

17:05 UTC: The PHP deployments queue is now completed. All other applications in the PAR zone, which had not been redeployed since the network came back, have also been queued for redeployment to fix any connection issue to their FSBucket add-ons.

19:10 UTC: A few applications which have the “deployment with downtime” option enabled were supposed to be UP but had no running instances. Those applications are now being redeployed.

Network incident details

Foreword: Clever Cloud has servers in two datacenters in the Paris zone (PAR). In this post-mortem, they are named PAR4 and PAR5.

A routine maintenance operation made by our Network Provider on PAR4 started a few minutes before the incident. This maintenance was about decommissioning a router that shouldn’t impact the network. Various checks and monitoring were in place, as usual, and a quick rollback procedure was planned in case anything went wrong.

The decommission triggered an unexpected election of another router, which then triggered a lot of LSA (link-state advertisement) updates between all the routers of the datacenter, sometimes doubling them. Those updates created new LSA rules on other routers, which first made them slower to update and routing traffic. Some of the routers then hit a configuration limit on the number of LSA rules. When hitting the limit, the router went into protection mode and shut itself down. This shutdown triggered other LSA updates on other routers which then also hit their LSA limit and entered in protection mode. This isolated PAR4 site from the network.

An internal equipment that had a link between PAR4 and PAR5 also propagated those LSA updates onto PAR5 routers, replicating the exact same scenario.

To fix this, our Network Provider disconnected some routers, lowering the number of LSA announcements across the network and bringing the routers back online.

Actions

Network provider

Actions taken

  • The equipment that had links between the two datacenters has been isolated and is now in its own network. This makes sure LSA updates aren't inadvertently sent to the second datacenter.
  • An isolation timeout has been lowered from 5 minutes to 1 minute, making the system react faster to failures.

Actions planned in a few days

  • Forbid any non-primary router to be elected as a leader to avoid any issue. According to their support contract with their suppliers, our network provider has officially sent a bug report to the manufacturer of the router which did not behave as expected and they are awaiting a fix and any relevant information.
  • Routers will now reject LSA rules when they hit their limit instead of going into protection mode. This will allow having a degraded network at first, instead of directly having a broken network. There are currently 4 different brands of routers and each one of them will be tested separately.
  • Other security measures have been taken. Additional monitoring and logs will also be added

Clever Cloud

Actions taken

  • Build cache management system database interaction performance improved + database performance itself improved
  • A deployment system bug with urgent queues is fixed, which allows us to deploy some applications before others (internal and Clever Cloud Premium customers)

Actions planned

  • Further improve performance and resilience of the build cache management system.
  • Improve the monitoring of the alerts queue, and the number of unreachable deployments being processed
  • Improve the visibility of urgent alerts among a high number of alerts
  • Improve the monitoring of the logs storage system
  • Improve the monitoring of the connectivity between FS buckets servers and hypervisors
  • Improve the monitoring of applications that should be up without having any instances
  • Improve our communication on our status page to post updates more frequently

Original incident details

We are currently experiencing a network accessibility issue on our PAR zone. We are investigating.

EDIT 12:21 UTC+2: Our network provider is looking into the issue.

EDIT 12:28 UTC+2: Deployments on other zones might not correctly work. But traffic shouldn't be impacted.

EDIT 12:30 UTC+2: Network connectivity seems to be back. We are awaiting confirmation of incident resolution from our network provider.

EDIT 12:35 UTC+2: Our network provider found the issue and fixed it. Network is back online since 12:30 UTC+2. Investigation will be conducted to understand why the secondary link hasn't been used.

EDIT 12:42 UTC+2: A postmortem will be made available later once everything has been figured out.

EDIT 12:50 UTC+2: The deployment queue is currently processing, queued deployments might take a few minutes to start

EDIT 13:00 UTC+2: Logs may also be unavailable depending on the applications

EDIT 13:20 UTC+2: Deployment queue still has a lot of items, the build cache feature is currently having troubles which slows down deployments.

EDIT 14:33 UTC+2: Deployments queue is now lower but there are still some issues with some of them. Logs are also partially available

EDIT 15:30 UTC+2: The build cache feature still has troubles, we are currently working on a workaround. Logs should now be back but there is a delay in processing which might affect availability on the Console / CLI. They might be a few minutes late.

EDIT 16:04 UTC+2: Some applications linked to FSBuckets systems might have lost their connection to the FSBucket, increasing their I/O and possibly rebooting in a loop for either Monitoring/Unreachable or Monitoring/Scalability. This can cause response timeouts, especially for PHP applications

EDIT 16:16 UTC+2: Build cache should be fixed, meaning that deployments should take less time

EDIT 16:53 UTC+2: There is still a lot of Monitoring/Unreachable events that are being sent, making a lot of application redeploy for no good reason. We are still working on it.

EDIT 17:18 UTC+2: The issue with Monitoring/Unreachable events has been fixed. The size of the deployments queue should go down.

EDIT 18:07 UTC+2: Most issues haves been cleared up. PHP applications may still be experiencing issues, we are working on it. If you are experiencing issues on non-PHP applications, please contact us.

EDIT 19:05 UTC+2: All PHP applications have been redeployed. If you are still experiencing issues, please contact us. All other applications which have not already been redeployed since the beginning of the incident will be redeployed in the next few hours (to make sure no apps are stuck in a weird state).