(All times in UTC)
16:30 we started seeing alerts about high load on the primary node.
17:00 we started getting report about the cluster being unreachable.
18:00 after checking the cluster, we decided to restart the primary node.
Data may have been lost as the node was not writing / replicating correctly. We are still waiting for the primary node to restart. The secondary does not seem to elect itself as primary.
19:30 the secondary finally got promoted as primary. We are blocking users with unfair use of the cluster.
22:45 we detect that the node we restarted failed to get back in the cluster. We decide to remove it entirely and re-create that node from scratch.
2023-03-13 10:00 the node has fully reached the "SECONDARY" state. We put it back into production.
Measures have been taken to prevent future unfair use from users.