Aaron Bertrand

Bad habits : Counting rows the hard way

October 30, 2014 by in T-SQL Queries | 10 Comments
SentryOne eBooks

In these books, you will find useful, hand-picked articles that will help give insight into some of your most vexing performance problems. These articles were written by several of the SQL Server industry’s leading experts, including Paul White, Paul Randal, Jonathan Kehayias, Erin Stellato, Glenn Berry, Aaron Bertrand, and Joe Sack.

Free Download

Featured Author

Paul Randal, CEO of SQLskills, writes about knee-jerk performance tuning, DBCC, and SQL Server internals.

Paul’s Posts

[See an index of all bad habits / best practices posts]

One of the slides in my recurring Bad Habits & Best Practices presentation is entitled "Abusing COUNT(*)." I see this abuse quite a bit out in the wild, and it takes several forms.

How many rows in the table?

I usually see this:

SELECT @count = COUNT(*) FROM dbo.tablename;

SQL Server has to run a blocking scan against the entire table in order to derive this count. That is expensive. This information is stored in the catalog views and DMVs, and you can obtain it without all of that I/O or blocking:

SELECT @count = SUM(p.rows)
  FROM sys.partitions AS p
  INNER JOIN sys.tables AS t
  ON p.[object_id] = t.[object_id]
  INNER JOIN sys.schemas AS s
  ON t.[schema_id] = s.[schema_id]
  WHERE p.index_id IN (0,1) -- heap or clustered index
  AND t.name = N'tablename'
  AND s.name = N'dbo';

(You can get the same information from sys.dm_db_partition_stats, but in that case change p.rows to p.row_count (yay consistency!). In fact, this is the same view that sp_spaceused uses to derive the count – and while it is much easier to type than the above query, I recommend against using it just to derive a count because of all of the extra calculations it does – unless you want that information, too. Also note that it uses metadata functions that do not obey your outer isolation level, so you could end up waiting on blocking when you call this procedure.)

Now, it's true that these views are not 100%, to-the-microsecond accurate. Unless you're using a heap, a more reliable result can be obtained from the sys.dm_db_index_physical_stats() column record_count (yay consistency again!), however this function can have a performance impact, can still block, and may be even more expensive than a SELECT COUNT(*) – it has to do the same physical operations, but has to calculate additional information depending on the mode (such as fragmentation, which you don't care about in this case). The warning in the documentation tells part of the story, relevant if you are using Availability Groups (and likely affects Database Mirroring in a similar way):

If you query sys.dm_db_index_physical_stats on a server instance that is hosting an AlwaysOn readable secondary replica, you might encounter a REDO blocking issue. This is because this dynamic management view acquires an IS lock on the specified user table or view that can block requests by a REDO thread for an X lock on that user table or view.

The documentation also explains why this number might not be reliable for a heap (and also gives them a quasi-pass for the rows vs. records inconsistency):

For a heap, the number of records returned from this function might not match the number of rows that are returned by running a SELECT COUNT(*) against the heap. This is because a row may contain multiple records. For example, under some update situations, a single heap row may have a forwarding record and a forwarded record as a result of the update operation. Also, most large LOB rows are split into multiple records in LOB_DATA storage.

So I would lean toward sys.partitions as the way to optimize this, sacrificing some marginal bit of accuracy.


    "But I can't use the DMVs; my count needs to be super accurate!"

    A "super accurate" count is actually pretty meaningless. Let's consider that your only option for a "super accurate" count is to lock the entire table and prohibit anyone from adding or deleting any rows (but without preventing shared reads), e.g.:

    SELECT @count = COUNT(*) FROM dbo.table_name WITH (TABLOCK); -- not TABLOCKX!

    So, your query is humming along, scanning all of the data, working toward that "perfect" count. Meanwhile, write requests are getting blocked, and waiting. Suddenly, when your accurate count is returned, your locks on the table are released, and all those write requests that were queued up and waiting, start firing off all kinds of inserts, updates and deletes against your table. How "super accurate" is your count now? Was it worth getting an "accurate" count that is already horribly obsolete? If the system isn't busy, then this isn't so much of an issue – but if the system isn't busy, I'd argue pretty strongly that the DMVs will be pretty darned accurate.

    You could have used NOLOCK instead, but that just means writers can change the data while you're reading it, and leads to other problems, too (I talked about this recently). It's okay for a lot of ballparks, but not if your goal is accuracy. The DMVs will be right on (or at least much closer) in a lot of scenarios, and further away in very few (in fact none that I can think of).

    Finally, you could use Read Committed Snapshot Isolation. Kendra Little has a fantastic post about the snapshot isolation levels, but I'll repeat the list of caveats I mentioned in my NOLOCK article:

    In addition, even with RCSI, getting the "accurate" count takes time (and additional resources in tempdb). By the time the operation is finished, is the count still accurate? Only if nobody has touched the table in the meantime. So one of the benefits of RCSI (readers don't block writers) is wasted.

How many rows match a WHERE clause?

This is a slightly different scenario – you need to know how many rows exist for a certain subset of the table. You can't use the DMVs for this, unless the WHERE clause matches a filtered index or completely covers an exact partition (or multiple).

If your WHERE clause is dynamic, you could use RCSI, as described above.

If your WHERE clause isn't dynamic, you could use RCSI as well, but you could also consider one of these options:

  • Filtered index – for example if you have a simple filter like is_active = 1 or status < 5, then you could build an index like this:
    CREATE INDEX ix_f ON dbo.table_name(leading_pk_column) WHERE is_active = 1;

    Now, you can get pretty accurate counts from the DMVs, since there will be entries representing this index (you just have to identify the index_id instead of relying on heap(0)/clustered index(1)). You do need to consider some of the weaknesses of filtered indexes, however.

  • Indexed view - for example if you are often counting orders by customer, an indexed view could help (though please don't take this as a generic endorsement that "indexed views improve all queries!"):
     
    CREATE VIEW dbo.view_name
    WITH SCHEMABINDING
    AS
      SELECT 
        customer_id, 
        customer_count = COUNT_BIG(*)
      FROM dbo.table_name
      GROUP BY customer_id;
    GO
    
    CREATE UNIQUE CLUSTERED INDEX ix_v ON dbo.view_name(customer_id);

    Now, the data in the view will be materialized, and the count is guaranteed to be synchronized with the table data (there are a couple of obscure bugs where this is not true, such as this one with MERGE, but generally this is reliable). So now you can get your counts per customer (or for a set of customers) by querying the view, at a much lower query cost (1 or 2 reads):

    SELECT customer_count FROM dbo.view_name WHERE customer_id = ;

    There is no free lunch, though. You need to consider the overhead of maintaining an indexed view and the impact it will have on the write portion of your workload. If you don't run this type of query very often, it's unlikely to be worth the trouble.

Does at least one row match a WHERE clause?

This, too, is a slightly different question. But I often see this:

IF (SELECT COUNT(*) FROM dbo.table_name WHERE ) > 0 -- or = 0 for not exists

Since you obviously don't care about the actual count, you only care if at least one row exists, I really think you should change it to the following:

IF EXISTS (SELECT 1 FROM dbo.table_name WHERE )

This at least has a chance of short-circuiting before the end of the table is reached, and will almost always out-perform the COUNT variation (though there are some cases where SQL Server is smart enough to convert IF (SELECT COUNT...) > 0 to a simpler IF EXISTS()). In the absolute worst case scenario, where no row is found (or the first row is found on the very last page in the scan), the performance is going to be the same.
 

[See an index of all bad habits / best practices posts]