The "Don't Repeat Yourself" principle suggests that you should reduce repetition. This week I came across a case where DRY should be thrown out the window. There are other cases as well (for example, scalar functions), but this one was an interesting one involving Bitwise logic.
Let's imagine the following table:
CREATE TABLE dbo.CarOrders
(
OrderID INT PRIMARY KEY,
WheelFlag TINYINT,
OrderDate DATE
--, ... other columns ...
);
CREATE INDEX IX_WheelFlag ON dbo.CarOrders(WheelFlag);
The "WheelFlag" bits represent the following options:
0 = stock wheels
1 = 17" wheels
2 = 18" wheels
4 = upgraded tires
So possible combinations are:
0 = no upgrade
1 = upgrade to 17" wheels only
2 = upgrade to 18" wheels only
4 = upgrade tires only
5 = 1 + 4 = upgrade to 17" wheels and better tires
6 = 2 + 4 = upgrade to 18" wheels and better tires
Let's set aside arguments, at least for now, about whether this should be packed into a single TINYINT in the first place, or stored as separate columns, or use an EAV model… fixing the design is a separate issue. This is about working with what you have.
To make the examples useful, let's fill this table up with a bunch of random data. (And we'll assume, for simplicity, that this table contains only orders that haven't yet shipped.) This will insert 50,000 rows of roughly equal distribution between the six option combinations:
;WITH n AS
(
SELECT n,Flag FROM (VALUES(1,0),(2,1),(3,2),(4,4),(5,5),(6,6)) AS n(n,Flag)
)
INSERT dbo.CarOrders
(
OrderID,
WheelFlag,
OrderDate
)
SELECT x.rn, n.Flag, DATEADD(DAY, x.rn/100, '20100101')
FROM n
INNER JOIN
(
SELECT TOP (50000)
n = (ABS(s1.[object_id]) % 6) + 1,
rn = ROW_NUMBER() OVER (ORDER BY s2.[object_id])
FROM sys.all_objects AS s1
CROSS JOIN sys.all_objects AS s2
) AS x
ON n.n = x.n;
If we look at the breakdown, we can see this distribution. Note that your results may differ slightly than mine depending on the objects in your system:
SELECT WheelFlag, [Count] = COUNT(*)
FROM dbo.CarOrders
GROUP BY WheelFlag;
Results:
WheelFlag Count
--------- -----
0 7654
1 8061
2 8757
4 8682
5 8305
6 8541
Now let's say it's Tuesday, and we just got a shipment of 18" wheels, which were previously out of stock. This means we are able to satisfy all of the orders that require 18" wheels – both those that upgraded tires (6), and those that did not (2). So we *could* write a query like the following:
SELECT OrderID
FROM dbo.CarOrders
WHERE WheelFlag IN (2,6);
In real life, of course, you can't really do that; what if more options are added later, like wheel locks, lifetime wheel warranty, or multiple tire options? You don't want to have to write a series of IN() values for every possible combination. Instead we can write a BITWISE AND operation, to find all the rows where the 2nd bit is set, such as:
DECLARE @Flag TINYINT = 2;
SELECT OrderID
FROM dbo.CarOrders
WHERE WheelFlag & @Flag = @Flag;
This gets me the same results as the IN() query, but if I compare them using SQL Sentry Plan Explorer, the performance is quite different:
It's easy to see why. The first uses an index seek to isolate the rows that satisfy the query, with a filter on the WheelFlag column:
The second uses a scan, coupled with an implicit convert, and terribly inaccurate statistics. All due to the BITWISE AND operator:
So what does this mean? At the heart of it, this tells us that the BITWISE AND operation is not sargable.
But all hope is not lost.
If we ignore the DRY principle for a moment, we can write a slightly more efficient query by being a bit redundant in order to take advantage of the index on the WheelFlag column. Assuming that we're after any WheelFlag option above 0 (no upgrade at all), we can re-write the query this way, telling SQL Server that the WheelFlag value must be at least the same value as flag (which eliminates 0 and 1), and then adding the supplemental information that it also must contain that flag (thus eliminating 5).
SELECT OrderID
FROM dbo.CarOrders
WHERE WheelFlag >= @Flag
AND WheelFlag & @Flag = @Flag;
The >= portion of this clause is obviously covered by the BITWISE portion, so this is where we violate DRY. But because this clause we've added is sargable, relegating the BITWISE AND operation to a secondary search condition still yields the same result, and the overall query yields better performance. We see a similar index seek to the hard-coded version of the query above, and while the estimates are even further off (something that may be addressed as a separate issue), reads are still lower than with the BITWISE AND operation alone:
We can also see that a filter is used against the index, which we didn't see when using the BITWISE AND operation alone:
Conclusion
Don't be afraid to repeat yourself. There are times when this information can help the optimizer; even though it may not be entirely intuitive to *add* criteria in order to improve performance, it's important to understand when additional clauses help whittle the data down for the end result rather than making it "easy" for the optimizer to find the exact rows on its own.
Good Post Aaron.
I think that since the extra condition is forcing a Range Scan, the performance benefit may vary based on the @Flag value passed where larger values will perform comparable to the IN list whereas the smaller values will have performance similiar to the scan.
A possible alternative which can give performance similiar to the in list is using a temp table as follows.
Thanks Roji,
Point taken about the range scan, however I don't think it could ever be *worse* than the bitwise operator alone, even though that is the variation most people will write (and it will always perform well when the table is still small).
And yes the @table variable can be a workaround, but it feels kind of "dirty" to me. :-)
Aaron,
I am in complete agreement with you about the point you are trying to make about providing redundant conditions to the optimizer. I also agree that the RANGE scan will not be *practically* worse than an index scan. But the RANGE (FORWARD) SCAN with only a START predicate introduce a unique performance problem. Users will experience extremely different performance for diifferent values of @Flag which returns the same number of rows.
"dirty" is subjective :). But from a consistent, predictable performance point of view, the temp table approach beats the other approaches discussed.
PS: Plan Explorer is pretty impressive!
Thanks Roji. And I guess there's always another layer of the onion. :-) And thanks again for your kind words the other day about aspfaq.com. It's amazing how often my content from that site still comes up today…
Hey,
great and interesting post!
I'm a bit sad about your interpretation of the DRY principle as you are not actually violating it at all.
The DRY principle states that "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system." Violating DRY, for example, would be interpreting the WheelFlag in the service code and in the client as well. In which case, both the TSQL code, the .NET code (service, for example) and Javascrip (Client, for example) hold the same "version of the truth" about how a WheelFlag should be interpret. This could then be solved by (for example) generating the TSQL, .NET and Javascript code based on the same XML file when building. THIS (and only this) is considered DRY.
Thanks
Jan
Thanks Jan. I agree with your explicit interpretation of DRY, but I also find that it is often extended in casual conversations to cover things where the spirit of the principle is involved, like the one I've described here (where you're providing the optimizer essentially redundant information) or cases where you put formatting into a user-defined function so you don't have to repeat it in two different queries. Sometimes these are good things, sometimes they are not.
True, however isn't this just proper use of OOP like incapsulating logic in small reusable components (classes/functions/…)?
Let's avoid discussing forever, I feel based on your reply to my comment that we both agree on the explicit interpretation of DRY and the spirit of the principle, I just found it slightly less obvious from your blog post. Which I still consider a great piece of writing, by the way!
Take care,
Jan
Thank You Aaron. asfaq.com and Vyas's site was two resources that immensely helped me in the early stages of my career. No wonder they still come up in disucussions.
maybe a little late to this party and just interested in bitwise programming I just played around with the code in this article and found that this wil only work for the given wheelflags and combinations 2,6.
how come the WheelFlag 6 is found as a result when only wheelflag 2 is given and when I want to find only the Wheelflag 4 it also comes up with the 5 and 6.
DECLARE @Flag TINYINT = 2;
SELECT OrderID
FROM dbo.CarOrders
WHERE WheelFlag & @Flag = @Flag;
this means 6 & 2 = 2
can anybody explain this?
in short: what does the & do ?
thanks for any reply
The documentation (e.g. here) will probably do a better job of explaining bitwise than I ever could.
Generally, my recommendation is to avoid encoding multiple pieces of information in a single value, for multiple reasons – including the driver behind this post.
This looks like an old post but if I may add this: with deceiving sql to do a seek on the bitwise column you get a wrong estimate which affects the other paths if you have this as part of a complex query.