Live blog: datacenter power outage's effects on Copernica's performance

by Edward Touw

For a software development company, it’s the worst case scenario: downtime. Because your servers are offline, because of a bug in a release or because of too much pressure on your servers. That’s why, at Copernica, we do everything in our power to prevent downtime from happening. Despite all precautions however, a power outage at our datacenter Leaseweb last night caused difficulties.

We test our software updates thoroughly before they go live. And because it’s better to be safe than sorry, we only release outside our peak hours. That way, if something unforeseen goes wrong, we can revert the update without having to bother our users with an interruption in service.

Also, we have an impressive number of backup servers. To be able to prevent an overload on our other servers. But also to keep our application up and running, should the other servers fail to perform for whatever reason.

Weakest link

A chain however is only as strong as its weakest link. Despite all precautions, in some cases you’re still dependent on the precisions of others. Something we were reminded of when last night technical difficulties at our datacenter Leaseweb cut off the power supply to our servers, shutting of our main servers as well as our backups.

What are the effects of this mistake for Copernica and its users? All times and dates in this live report are CET.

OCTOBER 30

11.45 PM: Just like that, our application and website are inaccessible. Our employee that has the support phone is woken up by an automatic alarm and starts his investigation.

OCTOBER 31

0.00 AM: The cause of the interruption is found. None of the hundreds of servers Copernica has are accessible. Other colleagues receive a wake-up call and drive off to Haarlem, where our datacenter Leaseweb is located.

0.30 AM: Arrival at datacenter. Inquiries taught us that a Leaseweb employee had accidently turned off the power supply to our servers.

0.45 AM: All servers are switched back on. Repairing all the damage however is going to take a lot of time. Whenever a database loses its power supply, all tasks that are in its memory are lost and it’ll run out of sync. This has to be restored first, before the databases will be able to perform as usual.

3.45 AM: The technical team is still working on repairing the issues. They have to go through enormous amounts of data before the servers will be operational again.

8.00 AM: All databases are now operational. The Copernica application and website are back online.

8.30 AM: Tasks are being performed again. Emailings are being sent, although in many cases with delays. Reparation works are still in full swing.           

8.35 AM: Mailings that were scheduled to be sent last night, are being checked manually by the technical team. All selections are being reviewed to decide if the mailing should still be sent or not.

If a database selection is made for all people that live in London for example, it’s relatively safe to assume that no major changes will have affected this selection in the last few hours and the mailing will be deployed.

But for other selections, like ‘everyone that clicked on hyperlink X two hours ago’ it’s not as easy to determine. Selections like these could be of date and not relevant for its recipients anymore. Users that scheduled mailings like these, receive a telephone call from Copernica to confirm whether it should be resumed or cancelled.

8.45 AM: Our support team receives the first reports from users that haven’t seen the notification about the power outage yet. Our sales and partner management teams are calling customers that are affected by the interruption.

11.45 AM: The support team is getting more reports about delayed emailings. The entire technical team is working on fixing issues.

2.00 PM: Leaseweb’s operation manager sends an email to Copernica, giving a short explanation of what happened:

“On October 15 all [Leaseweb - redacted] customers in scope were informed about the planned maintenance on October 30 2013.

The planning and expectation of the maintenance was that no servers and services would be affected.

Unfortunately due to a technical malfunction in an ATS device connected to the rack in scope, there was a power outage for a period of time.

I regret the unfortunate power outage and will we do our at most best to prevent future outages.”


2.30 PM:
The effects of the power outage are still noticeable. Task are being performed with delays, and not all databases have been repaired yet. We expect this issues to last at least all day.  

5.30 PM: Although some of our database servers are back on schedule, the majority of them still have a lot of catching up to do. We expect them to be able to do so tonight. And even though we're optimistic that this will improve the software's performance tomorrow, we still foresee a lesser performance than usual.

9.00 PM:  After a long day during which our technical team worked hard, the repair works have been finished and there seem to be no more delays. The application is running as usual. Our operations team will keep a close eye on the software to make sure everything functions properly.

If you are not sure whether your mailing was sent, please send an email to support@copernica.com.

NOVEMBER 1

9.30 AM: The application has been working as usual last night. All emailings were sent like scheduled and tasks were performed as planned.

11.50 AM: What exactly went wrong Wednesday night?

All servers at Leaseweb have two sources of power supply. That way, if for whatever reason one source should become unavailable, an ATS device ensures that the servers will get their power from the second source.

This switch however malfunctioned Wednesday night. So when the first power source got cut off for maintenance, our servers were deprived of electricity.

To prevent this issue from happening again in the future, our servers will now be connected to both power sources directly, making the ATS device redundant.

And while we’re on the subject, none of the user databases were ever at risk.

Besides our servers in Haarlem we also have backups in Amsterdam where we store replications of all data that stored in our (users’) databases.

So even in the extremely unlikely event that some kind of nature disaster strikes Haarlem, there will always be a copy of all user data safely stored in Amsterdam.

NOVEMBER 2

11.00 AM – Although emailings have been sent according to schedule as of Thursday evening, some users still might notice that other tasks are still being performed slower than usual. Imports are taking longer to complete for example, and selections are being built delayed. Our technical team is working on a solution. The expectation is that these issues might last for a few more days.

3.00 PM – Selections and imports are still being performed slower than usual. Our technical team expects that these tasks will be performed as usual again tomorrow.

NOVEMBER 4

1.00 AM – Some of the issues that appeared to have been solved last week, seem to be returning (although on a much smaller scale now). Besides the slower task performance, a small amount of emails are sent our delayed.

9.00 AM – No change yet. A handful users are experiencing delayed emails. The technical team thinks that they are close to a solution.

11.00 AM – All tasks are being performed as usual. Imports an selections are no longer delayed. All emailings are sent as scheduled.

NOVEMBER 6

12.00 PM - To prevent issues like the above from happening in the future, this Saturday we will connect our servers directly to the power supply of our datacenter. We advise you not to schedule any emailings that evening between 6 PM and midnight (CET).

NOVEMBER 7

1.30 PM  - Due to technical difficulties, Copernica was unavailable for a few minutes at around 1 PM this afternoon (CET). This issues had no effect on scheduled mailings and was not related to the recent problems with a power outage at our datacenter.

Because of is this interruption of service, some emails have been delayed. They are being sent at this moment.

This issue has been solved. No data was lost during this short interruption, and no additional user actions are required.

NOVEMBER 10

12.05 AM -The scheduled maintenance on the power feeds to the Copernica servers went well. All maintenance tasks were completed with virtually zero downtime, mailings and other tasks were not delayed. Apart from two short connection interruptions, this maintenance had no consequences for our users.

Related articles