Friday 14th October 2022

Infrastructure [RETROACTIVE] [PAR] Some databases instances went down.

At 04:30 UTC: a pulsar cluster started to behave strangely (See ) At 05:30 UTC: on PAR, notification services on the hypervisors try to send messages in a loop, filling the system with stuck processes. At 07:00 UTC: the OS of these hypervisors start to kill processes to make room. It impacted some applications and databases. We start working on shutting down the stuck processes and restarting the broken instances. At 10:00 UTC: we finish restarting all the broken instances.