Monday 10th May 2021

Access Logs HBase cluster supporting metrics down

(All times in UTC) At 22:50 we got an alert saying access logs stopped being consumed. At 22:53 we got alerts saying hbase region servers went down.

After investigation, the hadoop namenodes were all in standby. At 23:33, after various checks, we promote one back to active. We then restarted all the hbase regionservers, then waited for the cluster to balance and heal up.

At 00:04 we restart the warp10 stores. At 00:07 everything is back to normal.