[See an index of all bad habits / best practices posts]
One of the slides in my recurring Bad Habits & Best Practices presentation is entitled "Abusing
COUNT(*)." I see this abuse quite a bit out in the wild, and it takes several forms.
How many rows in the table?
I usually see this:
SELECT @count = COUNT(*) FROM dbo.tablename;
SQL Server has to run a blocking scan against the entire table in order to derive this count. That is expensive. This information is stored in the catalog views and DMVs, and you can obtain it without all of that I/O or blocking:
SELECT @count = SUM(p.rows) FROM sys.partitions AS p INNER JOIN sys.tables AS t ON p.[object_id] = t.[object_id] INNER JOIN sys.schemas AS s ON t.[schema_id] = s.[schema_id] WHERE p.index_id IN (0,1) -- heap or clustered index AND t.name = N'tablename' AND s.name = N'dbo';
(You can get the same information from
sys.dm_db_partition_stats, but in that case change
p.row_count (yay consistency!). In fact, this is the same view that
sp_spaceused uses to derive the count – and while it is much easier to type than the above query, I recommend against using it just to derive a count because of all of the extra calculations it does – unless you want that information, too. Also note that it uses metadata functions that do not obey your outer isolation level, so you could end up waiting on blocking when you call this procedure.)
Now, it's true that these views are not 100%, to-the-microsecond accurate. Unless you're using a heap, a more reliable result can be obtained from the
record_count (yay consistency again!), however this function can have a performance impact, can still block, and may be even more expensive than a
SELECT COUNT(*) – it has to do the same physical operations, but has to calculate additional information depending on the
mode (such as fragmentation, which you don't care about in this case). The warning in the documentation tells part of the story, relevant if you are using Availability Groups (and likely affects Database Mirroring in a similar way):
The documentation also explains why this number might not be reliable for a heap (and also gives them a quasi-pass for the rows vs. records inconsistency):
So I would lean toward
sys.partitions as the way to optimize this, sacrificing some marginal bit of accuracy.
- Sch-S locks still need to be taken even under RCSI.
- Snapshot isolation levels use row versioning in tempdb, so you really need to test the impact there.
- RCSI can't use efficient allocation order scans; you will see range scans instead.
- Paul White (@SQL_Kiwi) has some great posts you should read about these isolation levels:
"But I can't use the DMVs; my count needs to be super accurate!"
A "super accurate" count is actually pretty meaningless. Let's consider that your only option for a "super accurate" count is to lock the entire table and prohibit anyone from adding or deleting any rows (but without preventing shared reads), e.g.:
SELECT @count = COUNT(*) FROM dbo.table_name WITH (TABLOCK); -- not TABLOCKX!
So, your query is humming along, scanning all of the data, working toward that "perfect" count. Meanwhile, write requests are getting blocked, and waiting. Suddenly, when your accurate count is returned, your locks on the table are released, and all those write requests that were queued up and waiting, start firing off all kinds of inserts, updates and deletes against your table. How "super accurate" is your count now? Was it worth getting an "accurate" count that is already horribly obsolete? If the system isn't busy, then this isn't so much of an issue – but if the system isn't busy, I'd argue pretty strongly that the DMVs will be pretty darned accurate.
You could have used
NOLOCK instead, but that just means writers can change the data while you're reading it, and leads to other problems, too (I talked about this recently). It's okay for a lot of ballparks, but not if your goal is accuracy. The DMVs will be right on (or at least much closer) in a lot of scenarios, and further away in very few (in fact none that I can think of).
Finally, you could use Read Committed Snapshot Isolation. Kendra Little has a fantastic post about the snapshot isolation levels, but I'll repeat the list of caveats I mentioned in my
In addition, even with RCSI, getting the "accurate" count takes time (and additional resources in tempdb). By the time the operation is finished, is the count still accurate? Only if nobody has touched the table in the meantime. So one of the benefits of RCSI (readers don't block writers) is wasted.
How many rows match a WHERE clause?
This is a slightly different scenario – you need to know how many rows exist for a certain subset of the table. You can't use the DMVs for this, unless the
WHERE clause matches a filtered index or completely covers an exact partition (or multiple).
WHERE clause is dynamic, you could use RCSI, as described above.
WHERE clause isn't dynamic, you could use RCSI as well, but you could also consider one of these options:
- Filtered index – for example if you have a simple filter like
is_active = 1or
status < 5, then you could build an index like this:
CREATE INDEX ix_f ON dbo.table_name(leading_pk_column) WHERE is_active = 1;
Now, you can get pretty accurate counts from the DMVs, since there will be entries representing this index (you just have to identify the index_id instead of relying on heap(0)/clustered index(1)). You do need to consider some of the weaknesses of filtered indexes, however.
- Indexed view - for example if you are often counting orders by customer, an indexed view could help (though please don't take this as a generic endorsement that "indexed views improve all queries!"):
CREATE VIEW dbo.view_name WITH SCHEMABINDING AS SELECT customer_id, customer_count = COUNT_BIG(*) FROM dbo.table_name GROUP BY customer_id; GO CREATE UNIQUE CLUSTERED INDEX ix_v ON dbo.view_name(customer_id);
Now, the data in the view will be materialized, and the count is guaranteed to be synchronized with the table data (there are a couple of obscure bugs where this is not true, such as this one with
MERGE, but generally this is reliable). So now you can get your counts per customer (or for a set of customers) by querying the view, at a much lower query cost (1 or 2 reads):
SELECT customer_count FROM dbo.view_name WHERE customer_id = <x>;
There is no free lunch, though. You need to consider the overhead of maintaining an indexed view and the impact it will have on the write portion of your workload. If you don't run this type of query very often, it's unlikely to be worth the trouble.
Does at least one row match a WHERE clause?
This, too, is a slightly different question. But I often see this:
IF (SELECT COUNT(*) FROM dbo.table_name WHERE <some clause>) > 0 -- or = 0 for not exists
Since you obviously don't care about the actual count, you only care if at least one row exists, I really think you should change it to the following:
IF EXISTS (SELECT 1 FROM dbo.table_name WHERE <some clause>)
This at least has a chance of short-circuiting before the end of the table is reached, and will almost always out-perform the
COUNT variation (though there are some cases where SQL Server is smart enough to convert
IF (SELECT COUNT...) > 0 to a simpler
IF EXISTS()). In the absolute worst case scenario, where no row is found (or the first row is found on the very last page in the scan), the performance is going to be the same.
10 thoughts on “Bad habits : Counting rows the hard way”
You could create an indexed view that groups by a dummy column. Either use a synthetic column (like `0`) or use an existing low-cardinality column.
The view matching engine in Enterprise Edition can use such a view to match against a global counting operation. It will aggregate the already aggregated view contents by summing the counts.
This technique can cause a big deal of blocking because writes to different rows can not conflict as part of the view update.
This is my query
FROM sys.partitions AS p
INNER JOIN sys.tables AS t
ON p.[object_id] = t.[object_id]
INNER JOIN sys.schemas AS s
ON t.[schema_id] = s.[schema_id]
WHERE p.index_id IN ( 0, 1 ) — heap or clustered index
GROUP BY t.name
Will this bring the count of all the tables in my database or this query will consider the tables that don't have any index
Every table has at least one row in sys.partitions – if there is no clustered index, it will have index_id = 0, if there is a clustered index, it will have index_id = 1. The query you have will consider all tables, whether or not they have any non-clustered indexes.
It seems the other joins are unnecessary. The simplest query is:
FROM sys.partitions p
WHERE p.object_id = OBJECT_ID('MyTable')
AND p.index_id IN (0,1); — heap or clustered index
Anon, your query ignores schema (see this post) and I avoid many of the built-in metadata functions because they do not obey isolation semantics.
Gary, not sure what you're getting at, but I typically shy away from most of the metadata functions: https://blogs.sentryone.com/aaronbertrand/bad-habits-metadata-helper-functions/
Comments are closed.