Resolved
This incident has been resolved.
Monitoring
All servers are now online, the new rack is currently without CMI VRF. We're working on a fix and it should be fully operational soon.
Identified
All nodes except HKG12C0328 are up. We're still working to get it back online.
Identified
Our technician has arrived at site and discovered that the PDU has partially failed. We are now moving the affected servers to another rack to restore service. ETR is 30 minutes.
Identified
Some of our computing nodes are now back online. Based on the metrics we collected, it appears that the circuit breaker for one of the banks (the PDU has two banks) tripped when the primary PDU lost power. Although the secondary PDU took over the load, the circuit breaker still tripped. We are waiting for the datacenter to provide more information on this issue. We will provide further updates as soon as we have more information available.
Identified
We've noticed that some of our equipment is now back online, but the ToR switch remains offline as it has dual PSUs and is connected to different PDUs. We will provide more updates once we arrive at the datacenter or receive more information from the datacenter via the trouble ticket.
Identified
We are aware that two of our cabinets in Hong Kong accidentally lost power during maintenance, the technician has been dispatched and we will provide updates as soon as he arrives at the datacenter.
Investigating
We are currently investigating this issue.