(Times are in UTC)
At 2024-05-04 23:31 a database load balancer lost its network routes. The alert about that was set as low priority and did not wake up the on-call agent.
At 01:23, another service failed because of that load-balancer issue. This time, the failure triggered a high priority alert.
The on-call agent investigated the issue and saw that the load-balancer was responsible for the other service's failure. They fixed the network issue. Every impacted service got back online around 01:45.
Clever Cloud's PAR region has 8 of those load balancers. Only the services that were trying to connect to this one got downtime. Some customers applications redeployed themselves and connected to another one, quickly fixing the issue.
On 2024-05-06, we made the first alert a high priority one. It should already have been high priority. We also made sure that every other "load balancer is unreachable" alerts were high priority ones.