Short outage caused by a scheduled security update

Incident Report for Frontify

Resolved

On Monday morning, between 5:19 and 5:27 am UTC, one of our application clusters had a service interruption causing a short downtime.
The following coincidence caused this:
At midnight AWS took one of our application servers offline to mitigate a hypervisor issue.
Shortly after, at 5:19 am UTC, our scheduled security update took the second application server out of service, causing the short outage.
At 5:27 am UTC, the security updates finished successfully, and the server was retaken into service, ending the downtime.
After investigating the issue, we put the first application server into service, so we are fully redundant again.
To avoid this in the future, we enhanced monitoring and created automatic alerts notifying when there is only one application server in service.
Additionally, any future scheduled security updates will first check for enough healthy resources before starting.
We apologize for any inconvenience caused by this outage.

Posted Dec 20, 2021 - 05:19 UTC