Please help with STRING_SPLIT improvements

We're mid-cycle between releases, where we're not yet hearing about any of the features planned for SQL Server vNext. This is probably the best time to press Microsoft for improvements, as long as we can back our requests with legitimate business cases. In SQL Server 2016, STRING_SPLIT solved a long-missing gap in a language that, admittedly, was not intended for complicated string processing. And that is what I want to bring up today.

For years before SQL Server 2016 (and for years since), we've written our own versions, improved them over time, and even argued about whose was fastest. We blogged about every microsecond we could gain and I, for one, have stated multiple times, "this is my last post about splitting strings!" Yet here we are.

I'll always argue that table-valued parameters are the right way to break apart strings. But while I believe these comma-separated blobs of text should never be exposed to the database in that form, splitting strings continues to be a prevalent use case — a couple of my blog posts here are in the top 5 in views every single day.

So, why are people still trying to split strings with table-valued functions when a superior replacement exists? Some, I'm sure, because they're still on older versions, stuck in an older compatibility level, or can't get away from splitting strings at all because TVPs aren't supported by their language or ORM. For the rest, while STRING_SPLIT is both convenient and efficient, it isn't perfect. It has restrictions that carry some friction and that make replacing existing function calls with a native call either cumbersome or impossible.

Here is my list.

These limitations are not exhaustive, but I've listed the important ones in my priority order (and Andy Mallon blogged about this today too):

Single-character delimiter
It seems the function was created with only the dead-simple use case in mind: CSV. People have more complex strings than 1,2,3 or A|B|C, and they're often fed to their databases from systems outside of their control. As I describe in this answer and this tip, there are ways to work around this (really inefficient replace operations), but they're really ugly and, quite frankly, undo all the performance benefits offered by the native implementation. Also, some of the friction with this one specifically comes down to: "Well, PostgreSQL's string_to_array handles multiple character delimiters, so why can't SQL Server?"
Implementation: Increase the max size of separator.
No indication of input order
The output of the function is a set and, inherently, sets have no order. And while in most cases you will see an input string like bob,ted,frank come out in that order (bob ted frank), there is no guarantee (with or without a sloppy (ORDER BY (SELECT NULL)) hack). Many home-built functions include an output column to indicate the ordinal position in the string, which can be important if the list is arranged in a defined order or exact ordinal position has some significance.
Implementation: Add an option to include ordinal position column in the output.
Output type is based only on input
The output column of the function is fixed to either varchar or nvarchar, and is determined precisely by the length of the entire input string, not the length of the longest element. So, you have a list of 25 letters, the output type is at least varchar(51). For longer strings, this can trickle down to issues with memory grants, depending on usage, and can introduce problems if the consumer relies on another data type being output (say, int, which functions sometimes specify to avoid implicit conversions later). As a workaround, users sometimes create their own temp tables or table variables, and dump the output of the function there before interacting with it, which can lead to performance issues.
Implementation: Add an option to specify the output type of value.
Can't ignore empty elements or trailing delimiters
When you have a string like a,,,b,, you might expect only two elements to be output, since the other three are empty. Most custom TVFs I've seen trim off trailing delimiters and/or filter out zero-length strings, but STRING_SPLIT returns all 5 rows. This makes it difficult to swap in the native function because you also have to add wrapping logic to eliminate these entities.
Implementation: Add an option to ignore empty elements.
Can't filter duplicates
This is probably a less common request, and easy to solve by using DISTINCT or GROUP BY, but a lot of functions do this automatically for you. No real difference in performance in these cases, but there is if it's something you forget to add yourself (think a large list, with a lot of duplicates, joining to a large table).

Implementation: Add an option to filter out duplicates.

Here is the business case.

Those all sound theoretical, but here is the business case, which I can assure you is very real. At Wayfair, we have a substantial SQL Server estate, and we have literally dozens of different teams who have created their own table-valued functions over the years. Some are better than others, but they're all called from thousands and thousands of lines of code. We recently started a project where we are trying to replace them with calls to STRING_SPLIT, but we ran into blocking cases involving several of the above limitations.

Some are easy to work around, using a wrapper function. But the single character delimiter limitation forced us to evaluate the awful workaround using REPLACE, and this proved to eliminate the performance benefit we expected, making us pump the brakes. And in those cases, we lost a key bargaining chip in pushing for upgrades to compatibility level (not all databases are on 130, never mind 140). On those instances, we are losing out not just on STRING_SPLIT improvements, but also on other 130+ performance improvements we would be enjoying if STRING_SPLIT had been compelling enough on its own to push for the compat level upgrade.

So, I'm asking for your help.

Please visit this feedback item:

STRING_SPLIT is not feature complete

Vote it up! More importantly, leave a comment describing real use cases you have that make STRING_SPLIT a pain or a non-starter for you. Votes alone are not enough but, with enough tangible and qualitative feedback, there's a chance they might start taking these gaps seriously.

I feel like supporting multi-character delimiters (even, say, expanding from [n]varchar(1) to [n]varchar(5)) is an unintrusive improvement that will unblock many people who share my scenario. Other enhancements may be harder to implement, some requiring overloads and/or language enhancements, so I don't expect all of these fixes in vNext. But even one minor improvement would reiterate that STRING_SPLIT was a worthwhile investment, and that it is not going to be abandoned (like, say, contained databases, one of the more famous drive-by features).

Thanks for listening!

8 thoughts on “Please help with STRING_SPLIT improvements”

Andy Mallon says:

March 20, 2020 at 2:16 PM

I've written about my own scenario where STRING_SPLIT() falls short over on my blog:
https://am2.co/2015/04/converting-delimited-values-to-a-table-efficiently/
Jeff Moden says:

March 23, 2020 at 5:40 PM

For a lot of what I do, adjacent delimiters in the middle or delimiters at the leading and trailing edges of the string DO actually mean something. I don't mind an option to ignore them but removing them in a non-optional manner would continue to make it a mostly useless function to me and, I think, many others.

Optionally filtering duplicates would be fine so long as it doesn't slow thing down much for those of us that either don't want to eliminate duplicates or those that have more duplicates than they bargained for.

Whatever the case, it must not slow down just because someone uses a MAX datatype like so many other splitters do even when the string to be split is under 8K bytes.

Automatically changing the output type based on the data elements contained would be nice provided that it doesn't slow things down.

Multi-character delimiters would be a great addition provided, again, that it doesn't slow things down. I can't speak for anyone else but 99% of what I do is single character delimited.

The big thing for a lot of us (based on the feedback site) is to return (either optionally or permanently) the ordinal position of the elements from the original string even if the option to skip empty elements comes to pass (no pun intended).
Aaron Bertrand says:

March 23, 2020 at 5:45 PM

Yes. I'm asking for these things to be options, not default behavior (all but the max size of the delimiter explicitly say add an option). In fact all but delimiter length must be added as optional only, otherwise this will disrupt backward compatibility.
Tim Lehner says:

March 24, 2020 at 8:24 PM

create or alter function StringSplit (
@String nvarchar(max)
, @Delimiter nvarchar(4000)
)
returns table with schemabinding
as
return
select [key] + 1 as Ordinal, value as String
from openjson(replace(json_modify(N'[]', N'append $', @String), replace(@Delimiter, '"', '\"'), N'","'))
GO
Jason J Horner says:

April 1, 2020 at 1:57 PM

Another point and perhaps the larger issue here is that with the removal of CLR support for Azure SQL DB, Users no longer have the means to custom implement this type of functionality via CLR routines.

The SQL Server feature surface area is sadly lacking several features common in other platforms. by not providing developers viable options to extend functionality within Azure SQL DB:

Advanced JSON Parsing and manipulation
POSIX Regular expresion support
Advanced spatial operations

Databases like PostgreSQL and Oracle are light years ahead of SQL Server in these areas and to be honest the other engines have had these features for several years. It's time for the SQL Server team to get serious about improving the developer experience with effcient operators for these types of workloads.
Aaron Bertrand says:

April 1, 2020 at 5:14 PM

Hi Tim, yes, there are ways we can add ordinal output to our own UDFs. The purpose of a built-in function, though, was to prevent us from having to use our own UDFs at all. The function needs to be fully functional, though (no pun intended).
Tim Lehner says:

April 1, 2020 at 6:41 PM

Totally agreed, Aaron. It's very odd to me that this function was designed without an ordinal column of some sort, practically relegating it to where-clause usage only. I realize I didn't give much of a context from which to comment, but I've surprisingly found the above solution to be just as fast as any other that you or Jeff Moden have tested (at least in my environment which may be atypical), even though the replace calls would presumably push this toward the slower end of that spectrum.

To answer your direct call to action, I've voted-up the feedback item (and others like it in the past) and would like to add that ordinal position in the original string is the most important improvement to be made, by far. In it's current form, this otherwise promising function is effectively limited to replacing one ugly LIKE anti-pattern, and it's not even any faster from my tests. Variable delimiter width (including 0) would be nice-to-have, but any distinct/empty-element/trailing-delimiter options seem to have equivalent SQL language features or already require custom business logic to deal with a custom format.

I'm curious as to what you're thinking, though. Given the options of ignoring duplicates and/or empty strings, specifying trailing delimiters, and given the following set:

[2,3,,5,8,8,10,]

What is the ordinal of 8? How many elements are there? Does it depend on the optional arguments?

In short, I'm very much with you on wanting a feature-complete built-in function (which we don't yet have) but not going so far as to gold-plate this thing by duplicating DISTINCT, NULLIF, WHERE, or COUNT.

SEND_MAIL_TO could be helpful, though…
(joking!)
Aaron Bertrand says:

April 1, 2020 at 6:50 PM

For me I've found the performance advantages to be huge. But the big blockers I see commonly are multi-character delimiters (a blocker to use the native function) and the others I mentioned that people implement using CLR, as they were long recommended to do (a non-starter for migrating to Azure).

To answer your question, 8 is the 4th element if you ignore empty strings, 5th if you do not. If you use the option to keep duplicates, then you could argue it is _also_ the 5th/6th element, but I'm not sure there has to be _one_ answer in that case. There is an 8 at the 4th ordinal position and there is also an 8 at the 5th ordinal position.

Comments are closed.