Aaron Bertrand

Please help with STRING_SPLIT improvements

March 20, 2020 by in T-SQL Queries | 8 Comments
SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Paul Randal, CEO of SQLskills, writes about knee-jerk performance tuning, DBCC, and SQL Server internals.

Paul’s Posts

We're mid-cycle between releases, where we're not yet hearing about any of the features planned for SQL Server vNext. This is probably the best time to press Microsoft for improvements, as long as we can back our requests with legitimate business cases. In SQL Server 2016, STRING_SPLIT solved a long-missing gap in a language that, admittedly, was not intended for complicated string processing. And that is what I want to bring up today.

For years before SQL Server 2016 (and for years since), we've written our own versions, improved them over time, and even argued about whose was fastest. We blogged about every microsecond we could gain and I, for one, have stated multiple times, "this is my last post about splitting strings!" Yet here we are.

I'll always argue that table-valued parameters are the right way to break apart strings. But while believe these comma-separated blobs of text should never be exposed to the database in that form, splitting strings continues to be a prevalent use case — a couple of my blog posts here are in the top 5 in views every single day.

So, why are people still trying to split strings with table-valued functions when a superior replacement exists? Some, I'm sure, because they're still on older versions, stuck in an older compatibility level, or can't get away from splitting strings at all because TVPs aren't supported by their language or ORM. For the rest, while STRING_SPLIT is both convenient and efficient, it isn't perfect. It has restrictions that carry some friction and that make replacing existing function calls with a native call either cumbersome or impossible.

Here is my list.

These limitations are not exhaustive, but I've listed the important ones in my priority order (and Andy Mallon blogged about this today too):

  • Single-character delimiter
    It seems the function was created with only the dead-simple use case in mind: CSV. People have more complex strings than 1,2,3 or A|B|C, and they're often fed to their databases from systems outside of their control. As I describe in this answer and this tip, there are ways to work around this (really inefficient replace operations), but they're really ugly and, quite frankly, undo all the performance benefits offered by the native implementation. Also, some of the friction with this one specifically comes down to: "Well, PostgreSQL's string_to_array handles multiple character delimiters, so why can't SQL Server?"
    Implementation: Increase the max size of separator.
  • No indication of input order
    The output of the function is a set and, inherently, sets have no order. And while in most cases you will see an input string like bob,ted,frank come out in that order (bob ted frank), there is no guarantee (with or without a sloppy (ORDER BY (SELECT NULL)) hack). Many home-built functions include an output column to indicate the ordinal position in the string, which can be important if the list is arranged in a defined order or exact ordinal position has some significance.
    Implementation: Add an option to include ordinal position column in the output.
  • Output type is based only on input
    The output column of the function is fixed to either varchar or nvarchar, and is determined precisely by the length of the entire input string, not the length of the longest element. So, you have a list of 25 letters, the output type is at least varchar(51). For longer strings, this can trickle down to issues with memory grants, depending on usage, and can introduce problems if the consumer relies on another data type being output (say, int, which functions sometimes specify to avoid implicit conversions later). As a workaround, users sometimes create their own temp tables or table variables, and dump the output of the function there before interacting with it, which can lead to performance issues.
    Implementation: Add an option to specify the output type of value.
  • Can't ignore empty elements or trailing delimiters
    When you have a string like a,,,b,, you might expect only two elements to be output, since the other three are empty. Most custom TVFs I've seen trim off trailing delimiters and/or filter out zero-length strings, but STRING_SPLIT returns all 5 rows. This makes it difficult to swap in the native function because you also have to add wrapping logic to eliminate these entities.
    Implementation: Add an option to ignore empty elements.
  • Can't filter duplicates
    This is probably a less common request, and easy to solve by using DISTINCT or GROUP BY, but a lot of functions do this automatically for you. No real difference in performance in these cases, but there is if it's something you forget to add yourself (think a large list, with a lot of duplicates, joining to a large table).

    Implementation: Add an option to filter out duplicates.

Here is the business case.

Those all sound theoretical, but here is the business case, which I can assure you is very real. At Wayfair, we have a substantial SQL Server estate, and we have literally dozens of different teams who have created their own table-valued functions over the years. Some are better than others, but they're all called from thousands and thousands of lines of code. We recently started a project where we are trying to replace them with calls to STRING_SPLIT, but we ran into blocking cases involving several of the above limitations.

Some are easy to work around, using a wrapper function. But the single character delimiter limitation forced us to evaluate the awful workaround using REPLACE, and this proved to eliminate the performance benefit we expected, making us pump the brakes. And in those cases, we lost a key bargaining chip in pushing for upgrades to compatibility level (not all databases are on 130, never mind 140). On those instances, we are losing out not just on STRING_SPLIT improvements, but also on other 130+ performance improvements we would be enjoying if STRING_SPLIT had been compelling enough on its own to push for the compat level upgrade.

So, I'm asking for your help.

Please visit this feedback item:

Vote it up! More importantly, leave a comment describing real use cases you have that make STRING_SPLIT a pain or a non-starter for you. Votes alone are not enough but, with enough tangible and qualitative feedback, there's a chance they might start taking these gaps seriously.

I feel like supporting multi-character delimiters (even, say, expanding from [n]varchar(1) to [n]varchar(5)) is an unintrusive improvement that will unblock many people who share my scenario. Other enhancements may be harder to implement, some requiring overloads and/or language enhancements, so I don't expect all of these fixes in vNext. But even one minor improvement would reiterate that STRING_SPLIT was a worthwhile investment, and that it is not going to be abandoned (like, say, contained databases, one of the more famous drive-by features).

Thanks for listening!