When we started off on the Cisco product we were building, there were at least 5 of us with more than 8 years of Industry experience. All of us had *used* databases extensively at some point of time in our careers too.
That being the case, we still had our fair share of surprises, once we had the code rolling. We had performance issues, arising from doing what we thought is the normal way of using databases. Now why would something like that happen? After all, we were all experienced professionals, with so many computer texts under our belts.
The way i see it, the problem is caused by treating the systems as pure abstractions.
I see the database as something that will write data if I execute “INSERT” or “UPDATE” SQL’s. I read from normalized relational tables of data using “SELECTS”. What could be simpler than this?
Heck, i say we over simplified things. Nothing is ever as simple as it sounds. Systems always leak their abstractions and bigger the systems are, the more they leak. It pays to understand the leakages or the areas that you have to step around so to speak, because these are the most frequently encountered gotchas and performance issues.
If these “gotchas” can be factored in during the design phase, a level of rework can be avoided towards the end, depending on how easy your code or logic is to rework (tip : avoid duplication) or how extensively the feature is used.
Here is the list of gotchas you had better learn, before starting with your SQL server development.
- Learn to check for connection leaks. This is the easiest thing to do if you know how to use Perfmon.
- Index your tables – This was a late addition. But people just dont learn. By not indexing your tables, your queries are going to run for ever and ever and ever. Think about how you would find a page in a book without its Index list of topics. This is what you force upon your database if you do not have indexes.
- Fields used in the where clause of your queries must have indexes based on them. Different databases have different rules on this based on how they are written. But usually if you are specifying only a single field in your WHERE clause it would be good to have a single index based on that field. If you are specying multiple fields, it would be good to have a composite index with all the fields you have used in the query. But of-course you cannot go on adding indexes nilly willy as it really affects the writes. But i would say go ahead and add the indexes and then play around with what you need, based on your profile data and application behaviour.
- Avoid sub-queries and try to use joins instead – Joins take advantage of the mathematical basis of relational data sets and gives the inbuild system query optimizer hints about more efficient ways to run the query. This might be a more powerful way to optimize operations than anything you might come across.
- Application code should acquire connections just before they are required and release immediately afterwards. (Connection caching can be done by the system or your own app frameworks, but not by the everyday app code).
- Rule 4 does not apply if you do know that you might have to acquire and release connections in a loop or equivalent of-course.
- Know about stored procedures or rather understand why dynamic SQL can never match the performance of stored procedures. They are the fastest things around when it comes to working with data stored in the RDBMS. It saves on communication costs with the database and saves all sorts of CPU related costs due to parsing and creation of exectuion plans compared to dynamic queries. If you have a lot of queries being fired, this is a definite must.
More importantly, it keeps code (especially C++) from bloating up, handling ADO wrappers and checking for EOF, db exceptions etc just to get one piece of data.Further, it helps you keep all DB related code in one place, nice and clean. However do keep in mind, that if you do plan to scale your database, as in 1 DB being connected to from many clients, putting all those piles of data crunching at the sql server end will be detrimental to the CPU scalability of your DB box. You can still use the stored procedures for all your data retrieval and individual update purposes though.
- Always use disconnected recordsets instead of live ones. Basically it just collects the bunch of data you want, instead of maintaining a live connection with the DB and retrieving pieces of data as and when you request them. Live connections cause a lot of DB round trips and crazy amounts of page faults. I have debugged one system whose main performance factor was a crazy no of page faults which was finally zeroed in the loop that fetched data from connected live recordsets.
- Learn how to monitor your database. You should be looking for the number of physical and logical page reads caused by your query. This can be found from the SQL profiler tool. Learn to look for physical and logical reads / writes to find the queries that are most expensive in this regard for optimizing them away.
- Know about the famed SQL query plan cache. Basically sql server parses your queries, creates a plan for execution and keeps these around, for later, because this entire process is expensive. But unfortunately for us, sql server does not know how to distinguish between same queries with different parameters, unless you EXPLICITLY help the underlying system know that they are different, by having place holders for your queries and explicitly passing the individual parameters separately.These are called parameterized queries. Use these at least, (ideally you would be using stored procedures) if you do not want SQL server to keep eating up memory like crazy every time you issue a query. If SQL server does not have free memory to eat up, it will dynamically recreate your execution plans and in doing so eat up your cpu every time instead.NOTE : Eating up memory just because memory is available is not a free lunch too. The dynamics by which sql server frees up this memory is not so perfect. In our case we found that the system got there faster, and decided to enlarge the dynamic page cache instead of reporting an acute memory shortage. Moreover, the tighter an application is in terms of the memory it touches, the more faster it performs. If you do not why this is so, please refer the excellent video presentation by Herb Sutter.Here is a list of commands that can help you with this particular task
- Learn to put as MUCH as possible of all your writes under one transaction. This prevents log flushes due to individually separate writes. Putting as many no of changes under one transaction also reduced the no of implicit transactions that sql server starts for you. This reduces the load on sql server. You can watch the effect of your changes from perfmon – sql server Transactions per sec + database – log flushes per sec + database log flush wait time (Time spent waiting for disk to flush write logs alone) Depending on the nature of your writes you WILL bring down the amount of locks held overall too.
- Reduce the scope of your locks – Often folks do not know about serialization levels, which are nothing but system provided global locking schemes. When you have a global locking schemes, everything else stays locked out which is extremely bad, in terms of the amount of work the DB has to do to manage all those locks and dealing with contention issues etc. Reducing the scope of your locks will help improve the throughput you receive due to the increase in the no of concurrent DB operations that are made possible. So avoid all that methods that start one transaction and touch all those different tables on the way. Instead keep your transactions short and simple. (google serialization, locking hints, implicit transactions etc wrt your database to know what these are and how it affetcs your queries)
- Get all data in one shot – Sometimes people tend to write clean modular functions that focus on getting one data at a time from the database eg GetSnmpSetting, GetThresholdValues, GetSnmpTimeouts, GetDeviceType etc. However it would be easy to write a single stored procedure which would return all these values at one go. This IS much more efficient than writing multiple getters even though is more modular.
- If you are having a high no of DB writes, have a seperate RAID disk for the log files. This is because Relational databases uses the logs exclusively to maintain data integrity by using a running record of ALL modifications to the database. This can then be used to restore the data in case of failures. This means there are frequent no of writes to the disk and the faster you make this, the more throughput you will have from your disk writes.
- WARNING : Desperate optimization – Avoid writing to the DB if you can. Read and compare and write only the changed ones instead. This may sound so obvious, but none really thinks about it. When database is available, everyone tends to follow a load/process/save pattern. We never thought about this too and we are now adding them in, once we found it was required.You could do the same but do profile your code upfront to find if you require this. You can do this by writing to the DB from one function and finding how much time is spend in that fn as a percentage of the overall time spent – Intel profiler is a good tool to do this and you can use it for 30 days free. More than enough time to find where your performance choke points are.
- WARNING : Desperate optimization – This can get you into hot water – You could cache some data yourself, if it will avoid your going to the database every now and then. Maintaining a cache is really hard. So you could use memcached or Timesten(unfortunately only available for Oracle now) Disclaimer – this might pay dividends only if your systems are really big and the code is clean and you have only one entry and exit point for all your data changes. You might not require it if you are not building really big systems.DISCLAIMER – I have not used either memcached or TimesTen. Let me know what your mileage was. I doubt you should be using this unless you are creating really big distributed systems with lots of stuff going in and out of the database, like in a big Internet scale application.
- WARNING : Might not be very useful – Use an application maintained pooled connections to the database. This would be much faster than getting a connection from the DSN every time you require it, even though ODBC pools might make things faster for you. However this applies to only cases where connections are dropped and created extremely fast like almost 2000 or more connection churn every minute. (Tip : this could be why the connection pooling design shows the best results in an MTS like environment, where each object coming up gets its own connection every time it requires it and then drops it again)
Note : Historic Data Archival – If your application tend to collect a lot of data (more than tens of thousands of records running into millions every day) and keep them around, you are dealing with a different sort of application space. Here everything gets turned on its head and this is deep deep DB domain, where small tweaks now start to matter, the type of indexes and your disk block sizes start to matter and the process you use to keep your table sizes small will also matter. I’m not qualified enough to advice on such applications and you are better of consulting a Db expert experienced enough in the field of database warehousing.
If there is one thing i have learned from all these years of experience is that, performance is never accidental. You have to painstakingly pull up all those stops, one by one to extract the level of performance you can live with. Of-course you have to know what those performance stops are before you can pull them up. Knowing them early would help you to not pull up those remaining hairs you have.
ps : All these tips and more are listed in any page you get from the first google hit on sql server performance tips. However folks just don’t care enough to do a preliminary search before they start typing away. So why write it again ? Coz i keep seeing these issues over and over again.