Improving performance. Where are we now?
by Feike Wierda
From a business perspective, growth is a good thing. For a developer or system architect, it may cause some serious headaches. The latter is what the R&D and Operations departments at Copernica started experiencing roughly one year ago. With an ever growing number of users came increasing loads on all of our systems and virtually exponential growth of our managed data. At some point the load on our systems attained critical mass and performance started to suffer as a result. Systems that previously performed very well, started becoming sluggish. Recovering from hardware failure became practically impossible. All of this impacted the speed and availability of our software causing user experience to drop.
Identifying the bottlenecks
So, we realized we had a challenge on our hands, but then where do you start? While researching the issues we were facing, it became apparent that most of the hurt was being caused by high IO load. Also, we soon realized that the only proper solution is to make lighter, better software that generates less IO load. This, of course, is not realized overnight, so we trucked in loads of new servers, switches and other network hardware so that our network could cope with the software as it stood. We could now fight the war on bad performance on two fronts: The R&D department tasked with the painstaking process of identifying and fixing bottlenecks in the software and the Operations department expanding and improving the server park.
MySQL + centralized storage = badness
We have a couple of (supposedly) blazing fast Dell Equallogic storage units connected to a dedicated 10Ge fiber network. Surely that can't be a bottleneck? Turns out it can be. Dell Equallogics and SAN systems in general are great for general purpose storage, but handle the read/write mix caused by MySQL databases quite badly. So there we were, disillusioned owners of these wildly expensive SAN units which, to us, basically had more value as scrap metal than as a storage system. As it turns out, by far the best solution for MySQL databases, is to use fast (and cheap) local storage. We opted to equip our database servers with 300Gb 15k RPM SAS drives, which – when attached to a proper RAID controller – perform their tasks very well. Of course, we did not trash the Equallogics, they are now used to replicate our databases for backup purposes, which also allows us to recover from hardware failure in a matter of minutes.
Use more servers, rather than heavier ones
For the purpose of serving web requests, we moved from a few really heavy servers to several dozens of lightweight virtual ones. Session rates have increased dramatically.
The same holds true for our database servers. Evenly spreading the load across a large number of lightweight servers, easily beats having just a few really big ones. The process of redistributing databases is quite slow and we are currently still working on this.
Of course, the above conclusions were good news to both our hardware suppliers and data center.
Improve network performance
We now have an impressive number of servers and an even more impressive list of virtual ones. We virtualized servers using KVM/ Libvirt, which delivers not so great network performance. There's lots of layers between the actual hardware, making things overly complicated and, most of all, slow. We have since moved to Linux LXC containers, which outperform their KVM counterparts in almost every conceivable way. On top of that, they are also more flexible and easier to manage. Double win!
Decreasing internal network latency was achieved by replacing all of our existing switches with Juniper EX3300's, interconnected with 10Gbps fibers and stacked into several virtual chassis'. Using this setup we are able to aggregate two network links per server, essentially doubling their network capacity to 2Gbps.
That was the easy part
Despite all the tech terms and name dropping going on in the above, this was actually the easy part. The really major work was done in rewriting several parts of the software to reduce IO reliance. Also, some interesting new technologies such as Couchbase and RabbitMQ were introduced, which, by their very nature, are far less IO intensive than the previously used MySQL. We also parted with the latter in favor of MariaDB almost entirely. Going into the exact optimizations would get boring pretty quickly, but let's suffice to say that the optimizations in the software took the better part of a year to complete.
So are we there yet?
Unfortunately not, but then again, if we were, we would all be out of a job. There are still portions of the software that need attention in terms of speed and overall performance. Work is still being done on the network as well, to provide better redundancy, but also to improve speed. Looking back at the past year, the amount of work that has been done is immense. Now that we've got the software back to usable, the next step is to make it brilliant, so we still have our work cut out for us.