Past Incidents

Wednesday 20th January 2021

Infrastructure Investigating hypervisors issues

We are experiencing issues with hypervisors. We are investigating.

EDIT 15:45 UTC: Two hypervisors went down. The impacted services are:

Add-ons -> add-ons hosted on those servers are currently unavailable
Applications -> applications that were hosted on those servers should be redeployed or in the redeploy queue
Logs -> new logs won't be processed. This includes drains. You might only get old logs when using the CLI / Console
Shared RabbitMQ -> A node of the cluster is down, performance might be degraded
SSH -> No new SSH connection can be made on the applications as of now.
FS Bucket: a FS Bucket server was on one of the servers. Those buckets are unreachable and may timeout when writing / reading files

EDIT 15:54 UTC: Servers are currently rebooting.

EDIT 15:59 UTC: Servers rebooted and the services are currently starting. We are closely monitoring the situation.

EDIT 16:07 UTC: Services are still starting and we are double checking impacted databases.

EDIT 16:11 UTC: Deployment might take a few minutes to start due to the high deployment queue.

EDIT 16:33 UTC: Most services should be back online, including applications and add-ons. The deployment queue is still processing.

EDIT 16:45 UTC: The deployment queue is now empty since a few minutes, all deployments should go through almost instantly.

EDIT 17:13 UTC: Deployment queue is back to normal.

EDIT 17:15 UTC: The incident is over.

Tuesday 19th January 2021

Services Logs Logs ingestion issue

We have detected an issue affecting our logs collection pipeline. New logs are not being ingested. We are investigating.

15:52 UTC: The issue has been identified and should be fixed. We are monitoring things closely.

16:11 UTC: Overall traffic in the logs ingestion pipeline is not completely back to normal. If one of your applications does not have up-to-date logs you can try to restart it.

16:32 UTC: We have forced the hand of a component of the ingestion pipeline making it catch up with the logs waiting in queue. It should go back to normal in a matter of minutes now.

API Console and API performance issues

We are investigating performance issues with the API and console. This issue seems to be caused by our dedicated reverse proxies (which do not affect the performance nor availability of our customers' applications).

While investigating the issue, something broke in one of the reverse proxies which is causing availability issues. We are working on this.

10:25 UTC: The availability issue has been resolved. We are still working on resolving the performance issue.

10:32 UTC: We found the culprit and have implemented a work-around. Performance is back to normal. We are still working on an actual fix.

Monday 18th January 2021

No incidents reported

Sunday 17th January 2021

No incidents reported

Saturday 16th January 2021

No incidents reported

Friday 15th January 2021

No incidents reported

Thursday 14th January 2021

Pulsar Pulsar issues

Our pulsar cluster is currently having issues, we are investigating the impact it may have on the cluster's usage and how to resolve them.

EDIT 14:03 UTC: The problem is now resolved. Some connection issues happened but a retry would have worked.