The big database revision
by Feike Wierda
Copernica is in the process of revising its entire database infrastructure. The revision serves two purposes, the first being a much needed speed boost, the second - no less important - simplifying database administration. This post will contain a lot of tech talk which may not be so easy to follow for the not-so-techical reader. If you fall into that category, let me summarize the entire piece in a single sentence, so you can go about your - undoubtedly important - business: Stuff will go faster.
All Copernica databases are currently running on a mixed bag of Dell M610 and M620 blades equipped with traditional (spinning) disks. While the blades provide a lot of computing power in a small package, they don't really deliver on storage, the form factor of these servers limits the storage to two disks. Spinning at 15.000RPM, two disks provide around 250 IOPS, which - with Copernica's database usage - translates to maximum of around 200 databases running on a single server. Bearing in mind that we are serving several thousand databases and that every database server has at least one slave, that's a lot of servers requiring a lot of maintenance.
Responsiveness and rebuilds
There are two issues that surface each time our systems are put under a higher than average load: the responsiveness of the Copernica user interface suffers and rebuilding of views and databases takes more time. To effectively tackle these issues, we have decided to move away from spinning disks and the blades they reside in, and move to the much faster Solid State Disks (SSD's) in full size servers. This would mean faster databases and open the possibility of putting more (as much as 10 times as many) databases on a single server, thereby easing administration. Acquiring faster database servers only tells half the story, as the problem is more complex than just input/output (IO).
Rebuilds and slave lag
To understand how database server load affects the speed of rebuilds, let me dive a bit deeper into our rebuild system. To spread IO load and to keep the interface as responsive as possible, rebuilds are performed on database server slaves. For this purpose, the slaves need to be continiously up-to-date(ish). There are two notable issues with this setup. First, MySQL (or, in this case MariaDB) only has a single replication thread, which means that hundreds or thousands of queries that are performed simultaneously on the master, will be processed sequentially on the slave. Ouch. Second, rebuild queries can run for a long time, thereby locking the table and keeping the slave thread from what it's supposed to be doing, namely replicating queries. Now because the slaves are lagging, we cannot perform rebuilds as the data on the slave might be out of date, which means we have to wait for it to catch up, which in turn means delays.
More than just new hardware
Increasing the IO capacity of the servers will, of course, make queries run faster, which means that the slave will keep up more easily and tables will be locked a lot less. Still, while the master can process endless queries simultaneously, we're pushing all these queries through a single thread. Not optimal. This is where the shiny new MariaDB 10.0 comes in, which can run a replication thread not per server, but per database. We will be using the opportunity of the server switch to also migrate from the current MariaDB 5.5 to the brand spanking new 10.0. Apart from the multithreaded replication, version 10.0 comes with a host of other improvements, including online altering of tables, which makes it easier for us to make changes to the software without hurting anyone.
Sounds great, but when can we see the results?
At the time of writing this post, the new Dell R720's are being placed into our new cabinet at the EvoSwitch datacenter. The process of setting up the new MariaDB 10 and moving all the data is probably going to take about three to four weeks, meaning that this project should be completed early May. We will keep you posted on progress.
Keep an eye out for more improvements
While the operations department are working hard on making all these changes to the database infrastructure, the R&D department hasn't been sitting still either. You can expect to see an update on all the latest developments soon. It is even rumoured that work is being done on a new interface...
Image: A nice stack of new DELLs