Aaron Bertrand

Performance Myths : Oversizing string columns

SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Erin Stellato is a Principal Consultant with SQLskills and a Microsoft Data Platform MVP.

Erin’s Posts

Performance Myths : Oversizing varchar/nvarchar columns is okay

Back in March, I started a series on pervasive performance myths in SQL Server. One belief I encounter from time to time is that you can oversize varchar or nvarchar columns without any penalty.

Let's assume you are storing e-mail addresses. In a previous life, I dealt with this quite a bit – at the time, RFC 3696 stated that an e-mail address could be 320 characters (64chars@255chars). A newer RFC, #5321, now acknowledges that 254 characters is the longest an e-mail address could be. And if any of you have an address that long, well, maybe we should chat. :-)

Now, whether you go by the old standard or the new one, you do have to support the possibility that someone will use all the characters allowed. Which means you have to use 254 or 320 characters. But what I've seen people do is not bother researching the standard at all, and just assume that they need to support 1,000 characters, 4,000 characters, or even beyond.

So let's take a look at what happens when we have tables with an e-mail address column of varying size, but storing the exact same data:

CREATE TABLE dbo.Email_V320
(
  id int IDENTITY PRIMARY KEY, email varchar(320)
);

CREATE TABLE dbo.Email_V1000
( 
  id int IDENTITY PRIMARY KEY, email varchar(1000)
);

CREATE TABLE dbo.Email_V4000
(
  id int IDENTITY PRIMARY KEY, email varchar(4000)
);

CREATE TABLE dbo.Email_Vmax
(
  id int IDENTITY PRIMARY KEY, email varchar(max)
);

Now, let's generate 10,000 fictitious e-mail address from system metadata, and populate all four tables with the same data:

INSERT dbo.Email_V320(email) SELECT TOP (10000) 
  REPLACE(LEFT(LEFT(c.name, 64) + '@' + LEFT(o.name, 128) + '.com', 254), ' ', '')
  FROM sys.all_columns AS c
  INNER JOIN sys.all_objects AS o
  ON c.[object_id] = o.[object_id]
  INNER JOIN sys.all_columns AS c2
  ON c.[object_id] = c2.[object_id]
  ORDER BY NEWID();

INSERT dbo.Email_V1000(email) SELECT email FROM dbo.Email_V320;
INSERT dbo.Email_V4000(email) SELECT email FROM dbo.Email_V320;
INSERT dbo.Email_Vmax (email) SELECT email FROM dbo.Email_V320;

-- let's rebuild
ALTER INDEX ALL ON dbo.Email_V320  REBUILD;
ALTER INDEX ALL ON dbo.Email_V1000 REBUILD;
ALTER INDEX ALL ON dbo.Email_V4000 REBUILD;
ALTER INDEX ALL ON dbo.Email_Vmax  REBUILD;

To validate that each table contains exactly the same data:

SELECT AVG(LEN(email)), MAX(LEN(email)) FROM dbo.Email_;

All four of those yield 35 and 77 for me; your mileage may vary. Let's also make sure that all four tables occupy the same number of pages on disk:

SELECT o.name, COUNT(p.[object_id])
  FROM sys.objects AS o
  CROSS APPLY sys.dm_db_database_page_allocations
    (DB_ID(), o.object_id, 1, NULL, 'LIMITED') AS p
  WHERE o.name LIKE N'Email[_]V[^2]%'
  GROUP BY o.name;

All four of those queries yield 89 pages (again, your mileage may vary).

Now, let's take a typical query that results in a clustered index scan:

SELECT id, email FROM dbo.Email_;

If we look at things like duration, reads, and estimated costs, they all seem the same:

This may lull people into a false assumption that there is no performance impact at all. But if we look just a little bit closer, on the tooltip for the clustered index scan in each plan, we see a difference that may come into play in other, more elaborate queries:

From here we see that, the bigger the column definition, the higher the estimated row and data size. In this simple query, the I/O cost (0.0512731) is the same across all of the queries, regardless of definition, because the clustered index scan has to read all of the data anyway.

But there are other scenarios where this estimated row and total data size will have an impact: operations that require additional resources, such as sorts. Let's take this ridiculous query that doesn't serve any real purpose, other than to require multiple sort operations:

SELECT /* V  */ ROW_NUMBER() OVER (PARTITION BY email ORDER BY email DESC), 
    email, REVERSE(email), SUBSTRING(email, 1, CHARINDEX('@', email)) 
  FROM dbo.Email_V  
  GROUP BY REVERSE(email), email, SUBSTRING(email, 1, CHARINDEX('@', email))  
  ORDER BY REVERSE(email), email;

We run these four queries and we see the plans all look like this:

However that warning icon on the SELECT operator only appears on the 4000/max tables. What is the warning? It's an excessive memory grant warning, introduced in SQL Server 2016. Here is the warning for varchar(4000):

And for varchar(max):

Let's look a little closer and see what is going on, at least according to sys.dm_exec_query_stats:

SELECT 
  [table] = SUBSTRING(t.[text], 1, CHARINDEX(N'*/', t.[text])),
  s.last_elapsed_time, 
  s.last_grant_kb, 
  s.max_ideal_grant_kb
FROM sys.dm_exec_query_stats AS s 
CROSS APPLY sys.dm_exec_sql_text(s.sql_handle) AS t
WHERE t.[text] LIKE N'%/*%dbo.'+N'Email_V%' 
ORDER BY s.last_grant_kb;

Results:

In my scenario, the duration was not impacted by the memory grant differences (except for the max case), but you can clearly see the linear progression that coincides with the declared size of the column. Which you can use to extrapolate what would happen on a system with insufficient memory. Or a more elaborate query against a much larger data set. Or significant concurrency. Any of those scenarios could require spills in order to process the sort operations, and duration would almost certainly be affected as a result.

But where do these larger memory grants come from? Remember, it's the same query, against the exact same data. The problem is that, for certain operations, SQL Server has to take into account how much data *might* be in a column. It doesn't do this based on actually profiling the data, and it can't make any assumptions based on the <= 201 histogram step values. Instead, it has to estimate that every row holds a value half of the declared column size. So for a varchar(4000), it assumes every e-mail address is 2,000 characters long.

When it's not possible to have an e-mail address longer than 254 or 320 characters, there is nothing to gain by over-sizing, and there is plenty to potentially lose. Increasing the size of a variable-width column later is much easier than dealing with all the downside now.

Of course, oversizing char or nchar columns can have much more obvious penalties.