Some time back i encountered a scenario, in which a chain of stored procedures gets called recursively and this was eating into our performance (90% cpu time). All of a sudden obscure micro issue about whether to use table variable or temp tables suddenly became important.
After ferreting out some good sources of information on comparing these, we managed to reduce the cpu consumption by
  • Making all our stored proc use temp tables rather than table variables made a difference for us, because our procedures where getting invoked frequently and stored-procedures are recompiled every time if table variables are used.
  • But we had minimal impact from db writes due to temp tables because very few data was actually used
Ultimately the maximum improvement came from changes to the algorithm to make sure that the procedures do not get needlessly invoked or at least parts of them do not run (recursive algorithm was flattened out)

LearningIf micro optimizations start to look important, it is time to look for improvements in the algorithms that are used.
NOTES – Information gleamed from the exercise
  1. Most often you will require table variables
  2. Table variables cause FEWER recompilations than temporary tables. (source MSDN)
  3. Table variables seems to be scoped at the stored proc level. So if you create a table variable inside a while loop, it would retain old data (painfully found from a bug)
  4. Table variables cannot have indexes created on them except the primary key specified when creating these tables
  5. Create index command can cause query recompilations
  6. Table variables are held in memory and involve no disk IO – therefore no rollbacks are possible on these. On the upper side no locks are acquired by using table variables.
  7. Table variables cannot have statistics created on them so for huge data it COULD be inefficient  ?

but too lazy to read all the white papers?

Read a small story of this in action, from none other than Joel Spolsky trying it out on their product.

This piece of code we are about to discuss was used in production software and i was tasked with reviewing the patch. Try to see if you can spot *any* issues with code. Later we will take a look at the exact observations, which are hugely interesting and instructive regarding system behavior.

The Query

This query retains top n records of a particular type and deletes the other rows in the table of the same type from a Sybase database.

for (int i = 0;i < ListOfTypes; i++)
{
ExecuteQuery “delete from table where PrimaryKey NOT IN (select top n PrimaryKey from table where Id = i) AND Id = i”
}

Observation – No Data to be Deleted in the database

  1. CPU consumption during query execution is negligible when there are matching rows to be deleted. The fact that this code is issuing multiple dynamic queries which have to be parsed does not seem to be too much of an issue for Sybase.
  2. 2000 queries took around 3 seconds overall at around 12% of CPU cost on a Xeon box.

Observation – delete finds rows for deletion

  1. With data to be deleted, the query took 5 minutes when there was valid data in 30 loops, that is over 5 hours when deletes are extrapolated to be repeated 2000 times.
  2. The CPU consumption figure remained the same.

Why are Database Deletes So Slow ?

Relational databases tracks everything using logs or otherwise called Transaction Logs because they record all the transactions aka changes that happen to the database. The ACID nature of transactions are guaranteed using these logs. So when you make a change to the database, a LOG is written to disk.

The actual change is only made in memory and not written to the disks immediately, for efficiency sake. But Logs are always written promptly and without fail, so that if the machine crashes or rollback is required, this log can be used to reconstruct the state of the database with minimal loss of data. This means that every time you make an individual delete you are going to make the disk spin and disks spinning are horridly slower that anything else you can think about.

How slow can the disks be ?

Numbers from a talk by Jeff Dean (works on both MapReduce and BigTable) of Google, as reported here

Comparative slowness aka latency in terms of CPU cycles

Floating point calculations = 0.25 to 3 cycles
L1  cache access                       = 3 cycles
L2 cache access                       = 14 cycles
Dram                                            = 200 cycles
Flash                                            = 3 Million cycles
Disk                                               = 15 Million cycles

Comparative Slowness aka Latency in absolute time

L1 cache reference                                                        0.5 ns
Branch mispredict                                                             5 ns
L2 cache reference                                                             7 ns
Mutex lock/unlock                                                        100 ns
Main memory reference                                              100 ns
Compress 1K bytes with XXX                              10,000 ns
Send 2K bytes over 1 Gbps network                 20,000 ns
Read 1 MB sequentially from memory          250,000 ns
Round trip within same datacenter                500,000 ns
Disk seek                                                              10,000,000 ns
Read 1 MB sequentially from network    10,000,000 ns
Read 1 MB sequentially from disk            30,000,000 ns
Send packet CA->Netherlands->CA      150,000,000 ns

Can you speed up the disks?

SCSI disk array on a server

SCSI disk array on a server

You bet !!. We tried to speed up this whole experiment by moving the tests from an IDE based system (with 7500 rpm drive) to a system with a single (the picture has six – no we used just one of them) 15k RPM SCSI disk + SCSI controller (no on board memory).

The results from moving to SCSI system –

  1. The delete time improved by a factor of 4-5 times and the worse case average time period dropped to around 133 minutes from 5 hours.
  2. However with the SCSI system in use, the CPU consumption during query execution also jumped by the same amount, by 4-5 times because there are fewer blocks due to disks and therefore more query execution (4-5 times worth) is done by the DB engine.

As you can see, it would be really easy for us to calculate how many disks we would need to speed up the deletes to a manageable time. I might have to go with a couple of SCSI disks in a RAID configuration and try to achieve the speeds that i want.  At the same time the database engine’s CPU consumption would also have to be separately addressed such that it does not have to do much work in finding the rows to be deleted.

Is there any way to improve this ?

  1. Use Transactions –

    SQL server engine makes transactional records only as part of a single transaction. Not sure how this works vis via Java and sybase engine using which this query was tested. So if you ensure all your changes are made as part of a single transaction the disks would be less affected than if it was done otherwise.

  2. Do not use a database for real time stuff
    An even more optimal solution will be to not rely on the database at all and instead do all this in a specialized in memory component in a thread safe hash or something equivalent. Use database only as a slow data backup for being used during failure recovery. This would give the incomparable throughputs compared to any RDBMS / disk based solution.

My suggestion to improve CPU (untested)

  1. Move the entire loop into a stored procedure
  2. Run a select to get the top 20 and extract the ID of the bottom one
  3. Run a delete to delete all ID > ID extracted.What are the advantages – You remove  possible query compilation for the execution of 2000 queries and at the same time, remove usage of possibly highly inefficient NOT IN operator. This needs to be verified though.
  4. Remove all foreign key references to the current table – but of-course that would be going against the grain of RDBMS concept itself. However for desperate situations, this might work ….

What was implemented in the end?

Well, the production software had this delete sequence running every 30 mins. The no of events this system claims support to is only upto 25 events per hour.  So at this rate, there would be no pile up and no big deletes happening. So we dont have a problem !!!!

That’s easy – You start by looking at what parts of your code are the most expensive – Keep drilling down into the costly calls until you find which part of code really is the culprit. And then you optimize it. Easy isnt it ??

If you do not have ways to do this, your tool-set sucks.

[i use the Intel Thread Checker]
[click to see the big picture]

MSDE workload governor kicks in whenever there more than 8 concurrent queries running against the sql server engine. It then proceeds to introduce artifical delays in your query results such that the performance of the entire system is affected.

We faced the same issue for the Cisco NetManager tool that we were constructing and in the end, after a long period of trial and errors we have managed to cure us of this persistent little “feature” MSDE has.

So how does one go about beating / disabling the Workload Governor ?

  1. Optimize Optimize all dB operations.
  2. Reduce the no of round trips to the database by getting the data you know will be required later in one shot, in a single stored procedure or performing multiple operations together.
  3. Use a semaphore to the guard the no of concurrent accesses to the database – make sure all db access has to pass through this guard. The ADO methods one would have to guard are
    1. Execute
    2. Connection Open
    3. Connection Close
    4. Recordset – Open / Update / NextRecordset  /MoveFirst
    5. BeginTransaction / CommitTransaction / Rollback
  4. Perform application level caching of  DB connections if the no of open close connections are too high.
  5. Reduce the load at the sql server – how ? By caching the most frequently required Db data in memory – Not all applications can do this as data integrity becomes an issue. But if properly done, this is the single most important step that can give the biggest bang for your money.

    However please do note that this is also the single most dangerous act that can be done on a DB application. Managing a cache and controlling concurrent updates is not an easy task. We managed to achieve this by using only volatile reads and interlocked updates. Since no explicit locks are involved, this gives us a high level of concurrency.

This it – these small 5 steps are all that it takes to transform your slow moving implementation into a killer app at least wrt to the DB. Verifying these 5 steps took 6 months of time in skunk work style investigations.But now that i have a working application with these features built in, i can now confidently vouch for their efficiency.