/ Engineering

Client Area outage (2018-07-05): a post-mortem

tl;dr

The client area went down for 3 hours due to hard-disk issues. Our team has moved the installation to bring the site back up.

Service situation

The client area, just as the main site, is hosted on a server called "intern". Intern hosts these two sites using Nginx and PHP-fpm. Intern itself is a container running on "Polaris Master", a bare metal server hosting a few other internal services as well as the old web hosting server "Polaris". This setup has been this way for many years. The client area database has since been moved to a managed database service for guaranteed uptime and perfomance.
The client area runs on a software called WHMCS. It manages support, billing, orders and many more thus is seen as business critical.

What happened?

16:14 CEST: Reports have been made that the client area became slow by our staff as well as our internal monitoring.
16:27 CEST: Closer investigation started, no high loads or abnormal requests were found. No database connection issues either.
16:45 CEST: We did notice a high IO wait time. Since the server was shared between some internal sites we shut down those to prioritize the client area.
17:16 CEST: It became clear this was a hardware related issue. We suspect the harddisk is causing serious issues, we reported this to the datacenter for further investigation.
17:30 CEST: Since the service is critical we decided waiting on the datacenter would take too long so we started the process to move the installation to a new cloud server.
18:00 CEST: The client area came up again for a brief moment.
18:06 CEST: The final go was given to migrate the client area.
18:25 CEST: We moved over the DNS records and had a fully installed but non-functional system.
19:06 CEST: The client area became fully functional again.
19:18 CEST: We saw a very high load on the server caused by piled up requests from the downtime, this went away 10 minutes later.
19:23 CEST: We moved over the DNS of the main site as we noticed failed request.

Around 19:25 CEST everything was reporting OK.

note: this whole incident took place while on a mobile internet connection, most of which on public transport. Bad reception caused some delays in the steps.

What was the impact?

The client area is critical for us as it manages communication to our clients as well as billing delays. To be clear: this has not impacted any client radio stations at all, they were kept streaming throughout. The incident lasted around 3 hours but there were still successful requests done during this time period.

What will happen?

The current server is seen as an in between solution. We're awaiting a report of the physical state of "Polaris Master" to decide wether to continue using this server in the future or to move to a proper cloud solution for our client area. Currently due to licensing issues, the client area can only run on one physical server, we're getting in touch with WHMCS to discuss any options to get this on more physical servers and locations like we did for ITFrame.

Client Area outage (2018-07-05): a post-mortem
Share this