Thursday, February 19, 2009

SQL Server Cleaning Memory Caches

If you’re ever running a trace and you notice the occasional sproc taking unexpected longer periods then it should, such as doing 20 reads and taking 500 ms, you most likely have an IO bottleneck. Depending on the frequency determines the severity of the problem. For example if a sproc does 20 IO’s and returns in under 1 ms, 999 out of 1000, with the one going 500 ms, it’s expected behavior. The reason for this, is that under most systems you’re not keeping all your data in memory, and the call needs to go down to disk occasionally. SQL Server does a good job of read-aheads on scans, to avoid this problem, but not such a good job when it comes to seeks with key and Rid lookups. Hence it’s usually on the seeks when this is most common. Continuing with the trace analysis, if you’re seeing this problem once for every 10 calls then you’ve got an IO problem, which is you’re not able to keep enough data in memory.

There are three ways to approach this issue. The first is to buy more memory. By increasing the memory footprint, you allow for more data in the buffer pool, and less disk access. The second method around this is speed up your disk IO, by putting faster drives and or more spindles in raid, access times become much faster, and maybe you get the occasional call of 500 MS to 50 MS. These two above approaches have a hardware cost associated with, and often isn’t an option for a budget strapped shop such as a startup.

So how does the budget tight shop handle this: By ensuring that the system memory being used is most efficient? Over the next few blogs I’ll be discussing different strategies. In this blog I’ll talk about reviewing your memory clerks/cachestores and looking for areas that can be trimmed and maintained.

Getting into the internals of these memory stores is beyond the scope of this article, please reference http://blogs.msdn.com/slavao/archive/2005/02/19/376714.aspx if you want a deeper understanding. When troubleshooting, think of these Stores as carved out pieces of memory where sql server stores different types of data. The most popular one is the procedure cache where plans are kept. To get off subject for a second, the old procedure cache of SQL Server 2000 is now broken up into two cache stores “CACHESTORE_SQLCP” and “CACHESTORE_OBJCP”. One for ad-hoc queries and one for object code such as Stored sprocs and functions. SQL Server tracks it’s different cachestores using algorithms for costing and pruning. It’s these pruning processes that ensure the sizes don’t grow out of control. Unfortunately these algorithms aren’t what I consider optimal and therefore I regularly clean them manually.

With all the explanation out of the way, lets talk about getting at these stores and cleaning them up. The easiest way to view this data is to query the dmv sys.dm_os_memory_clerks. BOL has a descent explanation of this, but I want to point out the difference between single_pages_KB and Multi_pages_kb. Single_pages are those of 8 MB, or standard data pages. These reside within the SQL Server buffer pool. Multi_Pages are data pages that are larger then 8 MB, and reside in SQL Server’s reserved memory (Under 2000 this was called Memeory to Leave area). Our concern in the query below is looking at Single_pages or the buffer pool hence the sort.



select * from sys.dm_os_memory_clerks
order by single_pages_kb desc

Once you review the output you need to decide if any of these cache stores are larger then you like. For example in July 2007 I ran into a memory leak with TokenAndPermUserStore cache, where this cache was consuming 4 GB’s of the buffer pool and growing. This issue was fixed in a SP2 culumative patch. What led me down the trail to this blog was this morning the lock manager cache was 800 MB. Although the latter is not considered a memory leak by Microsoft, my personal feeling is it shouldn’t be that large, and I’d rather give the memory back to the buffer pool for better use.

In order to clear a particular cache you want to leverage the dbcc freesystem command, and paramiterize with the name of the cache. The name is the return “name” column in the query above. For example if I wanted to clear the lock manager cache my command would look like this:

DBCC FREESYSTEMCACHE ('Lock Manager : Node 0')


Once your store is clear, sql will start rebuilding, that store as if the service was restarted. Before leaving the subject a couple or notes. First the command above works on Node 0 which for all intensive purposes links to a CPU on a Numa system. There are some variations to how Numa works and would suggest reading the following blog.
http://blogs.msdn.com/psssql/archive/2008/01/24/how-it-works-sql-server-2005-numa-basics.aspx

Second the two largest caches you’ll see are those that represent the procedure cache “CACHESTORE_SQLCP” and “CACHESTORE_OBJCP”. It your system is architected properly then all calls should be through an object, but you’ll often see you’re ad-hoc cache (CACHESTORE_SQLCP) is larger. This is because the ad-hoc cache is often full of system statements primarily run in msdb. It makes sense to only clear the ad-hoc cache, for 3 reasons. First the msdb calls very rarely leverages cache, since its dynamic based on timestamps. Second if you clear the object cache all your objects will require recompilation which hurt your system in resources used as well as the chance of parameter sniffing occurring. Finally your object store has a ceiling based on how many SP’s you have on your server.

No comments: