Performance & Scalability

This piece of code we are about to discuss was used in production software and i was tasked with reviewing the patch. Try to see if you can spot *any* issues with code. Later we will take a look at the exact observations, which are hugely interesting and instructive regarding system behavior.

The Query

This query retains top n records of a particular type and deletes the other rows in the table of the same type from a Sybase database.

for (int i = 0;i < ListOfTypes; i++)
ExecuteQuery “delete from table where PrimaryKey NOT IN (select top n PrimaryKey from table where Id = i) AND Id = i”

Observation – No Data to be Deleted in the database

  1. CPU consumption during query execution is negligible when there are matching rows to be deleted. The fact that this code is issuing multiple dynamic queries which have to be parsed does not seem to be too much of an issue for Sybase.
  2. 2000 queries took around 3 seconds overall at around 12% of CPU cost on a Xeon box.

Observation – delete finds rows for deletion

  1. With data to be deleted, the query took 5 minutes when there was valid data in 30 loops, that is over 5 hours when deletes are extrapolated to be repeated 2000 times.
  2. The CPU consumption figure remained the same.

Why are Database Deletes So Slow ?

Relational databases tracks everything using logs or otherwise called Transaction Logs because they record all the transactions aka changes that happen to the database. The ACID nature of transactions are guaranteed using these logs. So when you make a change to the database, a LOG is written to disk.

The actual change is only made in memory and not written to the disks immediately, for efficiency sake. But Logs are always written promptly and without fail, so that if the machine crashes or rollback is required, this log can be used to reconstruct the state of the database with minimal loss of data. This means that every time you make an individual delete you are going to make the disk spin and disks spinning are horridly slower that anything else you can think about.

How slow can the disks be ?

Numbers from a talk by Jeff Dean (works on both MapReduce and BigTable) of Google, as reported here

Comparative slowness aka latency in terms of CPU cycles

Floating point calculations = 0.25 to 3 cycles
L1  cache access                       = 3 cycles
L2 cache access                       = 14 cycles
Dram                                            = 200 cycles
Flash                                            = 3 Million cycles
Disk                                               = 15 Million cycles

Comparative Slowness aka Latency in absolute time

L1 cache reference                                                        0.5 ns
Branch mispredict                                                             5 ns
L2 cache reference                                                             7 ns
Mutex lock/unlock                                                        100 ns
Main memory reference                                              100 ns
Compress 1K bytes with XXX                              10,000 ns
Send 2K bytes over 1 Gbps network                 20,000 ns
Read 1 MB sequentially from memory          250,000 ns
Round trip within same datacenter                500,000 ns
Disk seek                                                              10,000,000 ns
Read 1 MB sequentially from network    10,000,000 ns
Read 1 MB sequentially from disk            30,000,000 ns
Send packet CA->Netherlands->CA      150,000,000 ns

Can you speed up the disks?

SCSI disk array on a server

SCSI disk array on a server

You bet !!. We tried to speed up this whole experiment by moving the tests from an IDE based system (with 7500 rpm drive) to a system with a single (the picture has six – no we used just one of them) 15k RPM SCSI disk + SCSI controller (no on board memory).

The results from moving to SCSI system –

  1. The delete time improved by a factor of 4-5 times and the worse case average time period dropped to around 133 minutes from 5 hours.
  2. However with the SCSI system in use, the CPU consumption during query execution also jumped by the same amount, by 4-5 times because there are fewer blocks due to disks and therefore more query execution (4-5 times worth) is done by the DB engine.

As you can see, it would be really easy for us to calculate how many disks we would need to speed up the deletes to a manageable time. I might have to go with a couple of SCSI disks in a RAID configuration and try to achieve the speeds that i want.  At the same time the database engine’s CPU consumption would also have to be separately addressed such that it does not have to do much work in finding the rows to be deleted.

Is there any way to improve this ?

  1. Use Transactions –

    SQL server engine makes transactional records only as part of a single transaction. Not sure how this works vis via Java and sybase engine using which this query was tested. So if you ensure all your changes are made as part of a single transaction the disks would be less affected than if it was done otherwise.

  2. Do not use a database for real time stuff
    An even more optimal solution will be to not rely on the database at all and instead do all this in a specialized in memory component in a thread safe hash or something equivalent. Use database only as a slow data backup for being used during failure recovery. This would give the incomparable throughputs compared to any RDBMS / disk based solution.

My suggestion to improve CPU (untested)

  1. Move the entire loop into a stored procedure
  2. Run a select to get the top 20 and extract the ID of the bottom one
  3. Run a delete to delete all ID > ID extracted.What are the advantages – You remove  possible query compilation for the execution of 2000 queries and at the same time, remove usage of possibly highly inefficient NOT IN operator. This needs to be verified though.
  4. Remove all foreign key references to the current table – but of-course that would be going against the grain of RDBMS concept itself. However for desperate situations, this might work ….

What was implemented in the end?

Well, the production software had this delete sequence running every 30 mins. The no of events this system claims support to is only upto 25 events per hour.  So at this rate, there would be no pile up and no big deletes happening. So we dont have a problem !!!!

That’s easy – You start by looking at what parts of your code are the most expensive – Keep drilling down into the costly calls until you find which part of code really is the culprit. And then you optimize it. Easy isnt it ??

If you do not have ways to do this, your tool-set sucks.

[i use the Intel Thread Checker]
[click to see the big picture]

Here is how some of the biggest areas in the industry, manages their performance figures

  1. Databases – Oracle / MySQL
  2. Internet – Linked In / WikiPedia
  3. Hardware – File system / Disk & Memory Subsystems

Any place where, latency is involved, caching seems to make a huge difference in the results obtained. Caching still is state of the art. When deployed along with parallelization, the combination does make for killer applications. MemCached and Wide Finder comes to mind.

I did once try to write a couple of articles on creating scalable applications. However, the complexity that scale brings, makes each effort unique and therefore not duplicate-able.

So I’m basically sufficing this single article instead, which will serve as the starting point for any future thoughts around scalability in this space.

Here is a basic checklist of items which i feel, ought to be considered for any scale experiments.

1. Dispute those numbers – Despite everything you know, the numbers that bring the scale to your application, (1 million records or 1K events per sec etc) might be summarily wrong or not required. Questioning them (the numbers), especially if they end with nice comfortable zero’s, like 20,000 devices or 3 million records, might be the wisest thing you can do before embarking on a treacherous journey.

2. Prototype for Measurement – Built as many small mock ups that are required to roughly extrapolate to the core functionality that is required. This will allow you to measure the basic performance levels that might be required of the final application. You might find that your application will be constrained by either cpu / disk / memory / network. If your numbers suggest that none of these are a problem to begin with, congrats !!! go ahead and built your application. However, if you do find the numbers that are required for your application are restricted by any one or more of these basic resources, that would serve as the basis of your “what to scale” thoughts.

3. Know how much data you have to handle – The data rising from banking transactions might not be the same as the data that has to be handled on a YouTube like network. The data scale will affect all other things like the networks, the disk, memory and the cpu if it has to process / transmit all that data. Of-course the numbers arising from step 3 should also confirm this. So if your estimates are too less or too big, either your tests or your assumptions are wrong. This is the time to get them straight.

4.Decide how to scale – Once it is clear from your mockup numbers that some of your resources will be constrained (cpu / disk / memory / network), decide how you would like to scale them. Disk could be scaled using a disk array or raid device. Memory can be scaled by moving to a 64 bit platform and selecting a chassis that can allow you to access all the memory that you want. CPU can be again addressed by one of the fabulously multi-cored devices and so on. If you are lucky, a combination of these approaches should contain your scale limits within a single server.

If individually addressing your scale limits does not contain your full hardware requirements, obviously you are going to have to move out of a single machine. At this point, try to address the points from 1 to 3 again in search of a faint hope of avoiding to architect and built an entirely new distributed system.

This is the time to think out of the box and avoid costly operations or redefine how you need to go about your tasks. For example if your application is a giant spam search engine, redefining the way the core engine identifies the spams might help you beat the scale game before it starts.

4.Decide how to distribute – This is one of the toughest decisions you have to make when creating scalable systems. As the word system connotates, every distributed application is a system in its own right. The combinations and choices that can be made will be extremely varied depending on the exact nature of your problem. Rest of the points will address the nature of your distribution.

5. Decide what has to be broken up – Once you have decided, that you are going to have to distribute your application, the next decision that you will have to make, is how to break up your system into pieces that you can distribute. The strategy that can be used to make this desicison can be varied

a. Break by resource consumption – Each Db table on one machine?

b. Break by functionality – Order processing on one machine
Order validation on another machoine

c.Break by customer requirement – your customer cannot support 20 boxes all across the country and wants them in a single place.

6.Decide the communication requirement – how much and how frequent and how reliably – Once the component breakup is available, the next step forward is to decide how these components are going to communicate. The decision will have a bearing on your scale. If you cannot communicate more than 10 pieces of data per sec thats your scale limit for the whole system. The old adage that a chain is only as strong as its weakest link is extremely true at this juncture. The options you have available could be

a. message passing systems – omne / msmq / tibco / mqseries / etc.
Typically good for financial and banking systems
b. http with SOAP payload etc
c. RPC mechanism like DCOM / CORBA
d. Custom

7. Decide how failure / upgrades will be addressed – Once your application is broken up into pieces, individual pieces can come down or might require to be upgraded and such. Therefore, knowing in advance these requirements, will let you judge whether users will find the system acceptable. If entire transactions have to be stopped, every time one of the computer fails, good luck selling your system to banking customers. However a backup system going down, might not be so catastrophic, as long as the data is good and safe.

8. Decide how your data storage is going to look like – Having data storage as the last piece of your design might be blasphamous for some folks. However my rationale is that this would allow you to look into all other aspects of your system, without being burdened by all DB spawned, acronym based designs or being dictated by a single design philosophy revolving around RDBMS or object databases and such.

a. Decide what role data storage will play – It is common for small implementations to use a central Db server in a central role, making use of its multitude of facilities, for transactional updates, automatic data verfifications and scripting support. However not all designs can afford to have a central database at its core, simply because scaling this complex piece of software might prove to be extremely costly and complicated. In one case, we were able to break all our scale constraints by not keeping DB as central to our transactions. Instead DB served simply as a dump, and in memory custom coded systems provided all the speed we ever required. (This was done for equity trading)

9. Re-evaluate – Using each of the points listed, revaluate / revisit the other characteristics. eg Does making a file based storage make things easy on the disk but how are backups and failures handled? Will the message passing system, handle blobs of data bigger than x KB? Will transactional writes on the DB scale for the number of entries you want to make?

Every decision made will have a telling on the other. Balancing out all these wrt each other is what makes the architecture perfect.

Oh !!! and –

10. Keep measuring all the time – The main reason one comes up with scale is because we find choke points. Measurements are the only way once can know, what our status is wrt to the scale numbers. One of the major reasons things go drastically bad with such big systems is that the assumptions made at the outset no longer holds true by the time the system is built and ready to go. Keep a watch on your assumptions and keep measuring them to ensure the numbers hold true. Perhaps a requirement changed somewhere or someone decided to go differently. The only way to be sure is to keep measuring.

Happy scaling !!!

So you want to create a scalable solution but the confidence seems to be low overall. There are shadows of doubts – can we do it? will this really take off? is it too ambitious? … so on and so forth.

But really – why is scalability hard? Haven’t all this been done before? I mean seriously, there are huge web sites out there that already does this sort of thing don’t they? Whatever we want to do is not really a path breaking scientific effort. So why all this hoopla around scalability and why have so many doubts?

Google 2.3 ?

The answer lies in a simple observation. How many Amazon software versions have you seen on sale? How many shrink wrapped facebook or flickr versions have you come across? You don’t see Yahoo 2.0 or Google 2.3 out there do you?

Yes, its true that all we want to do has already been done so many times before. There have been countless instances of huge scale, done by so many teams and probably in much better or complex ways than we can even start to imagine. So let’s examine why shrink wrapped scalable applications are not so readily available like any other offshore-able work.

Compare a public media storage application like YouTube against a massively available Stock Trading Network. The challenges that each face are so different. While a YouTube like service requires to scale Terabytes upon Terabytes of readily available disk storage, a stock trading network would mostly want to control the network latency for sending updates. The data controlled by Flickr is not so critical but the Stock system needs to have a totally fool proof accountability for all the data that it processes. As you can see the requirements are totally different and the way you would do things are totally different too. A solution created in one domain might not be so readily adaptable to another.

Even if one were to take the case of two efforts in the same domain, the approach to scale would be totally different. If Site A chose to use Java + app server based technology Site B might go for a PHP with mySQL based approach. The models used, implementation technologies used and finally the approach used to scale could be totally different based on the design tradeoffs that are made.

Moreover, at the scales that we are talking about, even a single requirement can cause a ripple that needs to be adjusted and accounted for throughout the system.

eg Lets take the case of a basic content management system. Lets assume that Site A has an additional requirement to send change emails as a tracking mechanism every time something changes. A simple functionality change such as this could have profound implications on the latency of the operations, the network bandwidth required, configuration properties and such.

As you can imagine, no single war story is going to be the same. Custom scripting, tweaks, configuration headaches, you name it all of that will be different from implementation to implementation.

do-it-yourself-skyscraper kit?

More importantly, building a system in place that scales is a totally different effort compared to building one that scales and that can be sold commercially. Usually the scales of economy dictates that not many folks would require such a big solution for any domain. Businesses that usually require such high levels of technology tends to require lots of customizations which makes these huge solutions essential very different from one another. Then again technology is often viewed as a competitive advantage and companies prefer to roll their own rather than buy a ready made solution. If flickr software is commerically available (assuming its possible), how many buys do you anticipate? How would one flickr distinguish itself from another?

Therefore components of such big efforts like the database or the web server or the application server is instead what is generally available. The glue to tie all these into one single big comprehensible (& usually messy) high performance solution is extremely specific and totally incompatible from effort to effort.

Further, the aforementioned components of your architecture again are usually customized to be easy to use for the most number of customers because thats probably where the most money is. No architect i know would even bat an eyelid for a solution that requires a central Database of a normal size (2-4GB) and causes a fairly decent amount of hits. Everyone knows that the component would work out of the box probably aided by some small customizations that the installation can do for you.

Building big is the pain not scalability

The moment you deviate from the norm, things start becoming more complicated and less user friendly. The assumption that goes with this is that if a customer’s solution requires to do such work, he is also prepared to take the extra cost of employing a special administrator like a DBA or can pay for a special installation by the solution experts. Often this is also necessitated by the sheer amount of configuration required. Any DBA will tell you that no database of a fairly large size can run properly without custom tweaks ( de-normalize table A + put table B on a different server + use Indexing on table C and remove the one on Table D). Creating a self handling maintainable big solution is therefore really a big challenge,

A single installable that that puts your favorite editor in your laptop is therefore a far cry from these beasts of complexity with huge amounts of configurations and weeks of hands on learning curve just to be fairly conversant with the entire system.

Thats right. These big beasts are a system on their own. They are never referred to as a product for exactly the same reason. These efforts are created as a solution for a range of problems that needs to be tackled in an eco-system and as a result encompasses a vast range of technologies and platforms. This in turn usually translates into tons of custom code and glues which are rarely designed as a reusable sellable generic library specifically because they are so custom made in nature.

Any project thats fairly large in nature attracts different set of laws and penalties and process compared to the smaller efforts that people are normally used to.

Sell it too?
Oh and if you are trying to make a resellable scalable aka big solution – well thats much more of an effort than simply building one that works. You have to make all those custom configurations and scripts and tweaks customizable and easy to find and all that. This often means that the system has to take care of itself, is auto-configurable to an extend and has built in self diagnostics. Thats a toll order for anyone or anything in engineering.

This is why creating scalable applications is such a big deal and is still a new problem every-time you deal with it. What you are attempting to build is simply too big a project.

Its D-day, the project is ready functionality wise, everyone’s beaming, code is checked in, bug count is in single digits and the project is ready to release. Then out of the blue, the project gets swamped by swarms of last minute glitches relating to performance. From memory leaks, to database bottlenecks, to pure response time, series of firefightings & heated meetings ensues, causing embarrassments and slipped schedules all around. Sounds familiar?

However as in most such cases, true improvements or fixes are rarely possible due to the effort it represents. Rarely is the architecture found flexible enough to accommodate improvements to the tune of an order of magnitude, without major rewrites. Instead, cover ups and shove it under the seat approaches or spec reductions and upscaling of system requirements is what is attempted at. Depending on the soft skills and magnanimity of the team involved, the product is sometimes pushed out to face endless cycles of customer support or better still it is killed off before being released.

Is not premature optimization evil (advocated first by Tony Hoare and restated by Donald Knuth)? So what gives?

Turns out that for most people, ‘functionally complete’ does not implicitly translate into works-fast. I wonder if the same folks would consider a car functionally complete if 10 miles is the top speed at which it can go.

In this particular case, the primary culprit was that none of the engineers who had worked on the project had much experience with the particular database that was being used. But the architecture was designed around a centrally important database to hold and serve all the data that flowed through the system . Consequently bad database coding practices threw the worst punch when it came to performance. There was no way the product could be fast, without major rewrites. The net output was pretty, but somewhat un-usable & could not be justified to be pushed out.

The keyword here is the term “premature”. Knuth or Hoare never meant the advise as an abandonment of basic sanitary performance levels. Given the nature of the advise, one would expect people to not worry about the teeny tiny levels of optimizations like moving that extra if-check out of the loop or manually unrolling instructions etc unless it was measured and found to be warranted. But not considering performance during the initial and intermediary stages is nothing but negligent design.

However it is hard to think in terms of basic engineering once you start implementing code, in terms of alien frameworks, tools & libraries. Implementation is hidden and the net effect comes out only towards the end. (case for open source?) Unless someone with expertise in each of these closed areas is present, its difficult to get things right, the first time a new team attempts using these technologies.

It is unpractical though, to expect to have an expert in each of the scores of tools / frameworks (count your acronyms) that are part of any modern projects. Normal IT shops should therefore seriously consider premature planning / architect-ing for optimization. Getting extremely serious about performance from the very beginning, would be a good idea. You can wait to get paranoid, towards the end. Adopting best practices & making it a point to adhere to these across the entire code base, before you start churning out code is sensible engineering and prudent planning. Anything short of this is plain and simple negligence.

Googling premature optimization, surprisingly presented similar thoughts echoed around the web. Wise advises abound in the collective wisdom of the web.

Another take on same idea – cpu cycles might be cheap, but customer time waiting on that slow product of yours a’int.

Advice from Joeldon’t start a new project without at least one architect with several years of solid experience in the language, classes, APIs, and platforms you’re building on.

How do you train trainee programmers ? Simple, you do as the Shaolin masters do – keep them working and working and working on the same thing till, light dawns. ie the light of wisdom.

My sister who had joined an IT services company gets tasked with writing many such tools every week. implement grep. implement ls. implement malloc so on and so forth.

I get tasked with validating the output and sometimes i try some myself, before advocating the approach & piling on some advices for good ol’ sis.

So how would you write a small & fast grep equivalent? Any thoughts? Think for a moment before you read further.

Obvioulsy I needed lots of clever optimizations. right? The question is what are those. But before I optimized the hell out of the little grep my sis wrote, like any self respecting hacker, i sat down to compare the performance against a suitable competitor before making my advocacies. I chose to compare the speed of the grep we wrote to the DOS find program, since at the time i was working on a VC++ project back at the cube. These are the numbers –

myfind – runs on a 600K working set using a VM size of 108 K – CPU hits is 50%
DOS find – climbs upto a 200 MB working set using a VM size of 1500K – CPU howers around 20%

myfind – finishes searching in 5-7 secs
DOS find – takes 1 minute !!!!

Excuse me, but did my kid sister just whack the hell out of a code thats been running for years? Definitely not i suppose. Perhaps Microsoft never wanted to optimize FIND for huge files. Perhaps it works faster on small files. Perhaps my program is broken. Perhaps DOS FIND does many more super smart things when it searches for a sub string. Maybe it does not want to fry the processor. Maybe my measurement was off.
But the point is, we measured before optimizing and wow does it make a difference. The performance profile of my program was exactly the way i liked it.

I dont think all of us might get this lucky everytime we construct code. But it definitely pays to see the light – “premature optimization is the root of all evil” – Knuth

GREP code