I did once try to write a couple of articles on creating scalable applications. However, the complexity that scale brings, makes each effort unique and therefore not duplicate-able.

So I’m basically sufficing this single article instead, which will serve as the starting point for any future thoughts around scalability in this space.

Here is a basic checklist of items which i feel, ought to be considered for any scale experiments.

1. Dispute those numbers – Despite everything you know, the numbers that bring the scale to your application, (1 million records or 1K events per sec etc) might be summarily wrong or not required. Questioning them (the numbers), especially if they end with nice comfortable zero’s, like 20,000 devices or 3 million records, might be the wisest thing you can do before embarking on a treacherous journey.

2. Prototype for Measurement – Built as many small mock ups that are required to roughly extrapolate to the core functionality that is required. This will allow you to measure the basic performance levels that might be required of the final application. You might find that your application will be constrained by either cpu / disk / memory / network. If your numbers suggest that none of these are a problem to begin with, congrats !!! go ahead and built your application. However, if you do find the numbers that are required for your application are restricted by any one or more of these basic resources, that would serve as the basis of your “what to scale” thoughts.

3. Know how much data you have to handle – The data rising from banking transactions might not be the same as the data that has to be handled on a YouTube like network. The data scale will affect all other things like the networks, the disk, memory and the cpu if it has to process / transmit all that data. Of-course the numbers arising from step 3 should also confirm this. So if your estimates are too less or too big, either your tests or your assumptions are wrong. This is the time to get them straight.

4.Decide how to scale – Once it is clear from your mockup numbers that some of your resources will be constrained (cpu / disk / memory / network), decide how you would like to scale them. Disk could be scaled using a disk array or raid device. Memory can be scaled by moving to a 64 bit platform and selecting a chassis that can allow you to access all the memory that you want. CPU can be again addressed by one of the fabulously multi-cored devices and so on. If you are lucky, a combination of these approaches should contain your scale limits within a single server.

If individually addressing your scale limits does not contain your full hardware requirements, obviously you are going to have to move out of a single machine. At this point, try to address the points from 1 to 3 again in search of a faint hope of avoiding to architect and built an entirely new distributed system.

This is the time to think out of the box and avoid costly operations or redefine how you need to go about your tasks. For example if your application is a giant spam search engine, redefining the way the core engine identifies the spams might help you beat the scale game before it starts.

4.Decide how to distribute – This is one of the toughest decisions you have to make when creating scalable systems. As the word system connotates, every distributed application is a system in its own right. The combinations and choices that can be made will be extremely varied depending on the exact nature of your problem. Rest of the points will address the nature of your distribution.

5. Decide what has to be broken up – Once you have decided, that you are going to have to distribute your application, the next decision that you will have to make, is how to break up your system into pieces that you can distribute. The strategy that can be used to make this desicison can be varied

a. Break by resource consumption – Each Db table on one machine?

b. Break by functionality – Order processing on one machine
Order validation on another machoine

c.Break by customer requirement – your customer cannot support 20 boxes all across the country and wants them in a single place.

6.Decide the communication requirement – how much and how frequent and how reliably – Once the component breakup is available, the next step forward is to decide how these components are going to communicate. The decision will have a bearing on your scale. If you cannot communicate more than 10 pieces of data per sec thats your scale limit for the whole system. The old adage that a chain is only as strong as its weakest link is extremely true at this juncture. The options you have available could be

a. message passing systems – omne / msmq / tibco / mqseries / etc.
Typically good for financial and banking systems
b. http with SOAP payload etc
c. RPC mechanism like DCOM / CORBA
d. Custom

7. Decide how failure / upgrades will be addressed – Once your application is broken up into pieces, individual pieces can come down or might require to be upgraded and such. Therefore, knowing in advance these requirements, will let you judge whether users will find the system acceptable. If entire transactions have to be stopped, every time one of the computer fails, good luck selling your system to banking customers. However a backup system going down, might not be so catastrophic, as long as the data is good and safe.

8. Decide how your data storage is going to look like – Having data storage as the last piece of your design might be blasphamous for some folks. However my rationale is that this would allow you to look into all other aspects of your system, without being burdened by all DB spawned, acronym based designs or being dictated by a single design philosophy revolving around RDBMS or object databases and such.

a. Decide what role data storage will play – It is common for small implementations to use a central Db server in a central role, making use of its multitude of facilities, for transactional updates, automatic data verfifications and scripting support. However not all designs can afford to have a central database at its core, simply because scaling this complex piece of software might prove to be extremely costly and complicated. In one case, we were able to break all our scale constraints by not keeping DB as central to our transactions. Instead DB served simply as a dump, and in memory custom coded systems provided all the speed we ever required. (This was done for equity trading)

9. Re-evaluate – Using each of the points listed, revaluate / revisit the other characteristics. eg Does making a file based storage make things easy on the disk but how are backups and failures handled? Will the message passing system, handle blobs of data bigger than x KB? Will transactional writes on the DB scale for the number of entries you want to make?

Every decision made will have a telling on the other. Balancing out all these wrt each other is what makes the architecture perfect.

Oh !!! and –

10. Keep measuring all the time – The main reason one comes up with scale is because we find choke points. Measurements are the only way once can know, what our status is wrt to the scale numbers. One of the major reasons things go drastically bad with such big systems is that the assumptions made at the outset no longer holds true by the time the system is built and ready to go. Keep a watch on your assumptions and keep measuring them to ensure the numbers hold true. Perhaps a requirement changed somewhere or someone decided to go differently. The only way to be sure is to keep measuring.

Happy scaling !!!