Tuesday, April 26, 2016

SQL Server Index index intersection with the "OR" clause



The two most hated things a DBA hates is a request for a backup from three years ago, and coming across a query with an OR clause in it.  Actually that’s not true as DBA’s we’re just a grumpy group, but I don’t have a month to list out all my dislikes, so let me focus on the “OR” clause within a query.  Unfortunately many devs think that covering your predicates with a single index works, and on small data sets in test it does, but as the index grows the duration times grow linear since a scan is used in an OR clause if not indexed correctly.  This article will talk about how to index for an OR clause.  Rather than trying to verbally explain this through writing, I’ll go through with example. 
Below I’ve created a table, index, and populated enough data to make my point.
create table testtable
(FOne int, ftwo int, FillerField char(1000))

create index aaa on testtable (fone, ftwo)



Declare @interval int
set @interval = 1

while @interval < 10000
Begin
insert into testtable
values (@interval, @interval + 50000, 'Hello')
set @interval = @interval + 1
End

As you can see the table has a covering index on Fields Fone and Ftwo.  I created FillerField so there is enough data to make my point.  The covering index works great if there is an “AND” clause.  Note when I say covering, I’m talking about the predicates and not the entire query.


set statistics io on

select * from testtable where fone = 100 and ftwo = 50100


If we look at the IO’s in the messages (Why we use Set Statistics IO on) you’ll see it did 3 reads, using a seek with a Rid Lookup.  



The Seek is shown in the Query Plan above.  The Seek is used because FOne is unique and the predicate requires both Fone and FTwo meet the value so SQL simply needs to find all the values on the far left of the index its using, in this case aaa(fone, ftwo).  So it will look for where fOne = 100, once all the records with this value are found (acutally as they’re searched) it will examine the second value in FTwo, to see which ones match in this case  50100.


However using “AND” differs from the “OR” clause.   it doesn’t work this way, it needs to find all the data from both fields that match fone or data matching ftwo.  So if we change the query to an “OR” clause it will move from a seek to a scan.


select * from testtable where fone = 100 or ftwo = 50100 





The reads have gone from 3 to 31.  Here the predicates are covered but since it can be either or it can’t do a seek it has to examine each row and column.  In fact in many situations SQL might find it cheaper to scan the underlying table in this case a RID. 

In order to get this efficient, and remove the scan, we need to do a seek against each field in the OR clause,  and leverage an intersection index lookup.   By adding the below index, in addition to the one already created you’ll see each field now has a seek, and the results are concatenated into temporary storage where dupes thrown out.  Once this distinct intermediate data is determined, it is used to lookup against the RID.



create index bbb on testtable(ftwo)


select * from testtable where fone = 100 or ftwo = 50100





The following example mimics something I recently saw in a production, the question is how something like this gets out into the wild. Usually this is a result of two circumstances.  First most testing is functional testing, and the difference in few milliseconds isn’t noticeable.  The second is even if you’re testing performance the index scan against a small set doesn’t stand out, and because it is covered the scan itself is small, its only noticeable after the dataset grows, or the volume of calls increase that it becomes noticeable.  





No comments: