SQL Server 2008 introduced sparse columns as a method to reduce storage for null values and provide more extensible schemas. The trade-off is that there is additional overhead when you store and retrieve non-NULL values. I was interested in understanding the cost for storing non-NULL values, after talking to a customer who was using this data type in a staging environment. They're looking to optimize write performance, and I wondered whether the use of sparse columns had any effect, as their method required inserting a row into the table, then updating it. I created a contrived example for this demo, explained below, to determine if this was a good methodology for them to use.
As a quick review, remember that when you create a column for a table which allows NULL values, if it is a fixed-length column (e.g. an INT), it will always consume the whole column width on the page even when the column is NULL. If it is a variable-length column (e.g. VARCHAR), it will consume at least two bytes in the column offset array when NULL, unless the columns are after the last populated column (see Kimberly's blog post Column order doesn't matter…generally, but – IT DEPENDS). A sparse column does not require any space on the page for NULL values, whether it's a fixed-length or variable-length column, and regardless of what other columns are populated in the table. The trade off is that when a sparse column is populated, it takes four (4) more bytes of storage than a non-sparse column. For example:
|Column type||Storage requirement|
|BIGINT column, non-sparse, with no value||8 bytes|
|BIGINT column, non-sparse, with a value||8 bytes|
|BIGINT column, sparse, with no value||0 bytes|
|BIGINT column, sparse, with a value||12 bytes|
Therefore, it's essential to confirm that the storage benefit outweighs the potential performance hit of retrieval – which may be negligible based on the balance of reads and writes against the data. The estimated space savings for different data types is documented in the Books Online link provided above.
I set up four different scenarios for testing, described below, and every table had an ID column (INT), a Name column (VARCHAR(100)), and a Type column (INT), and then 997 NULLABLE columns.
|Test ID||Table Description||DML Operations|
|1||997 columns of INT data type, NULLABLE, non-sparse||Insert one row at a time, populating ID, Name, Type, and ten (10) random NULLABLE columns|
|2||997 columns of INT data type, NULLABLE, sparse||Insert one row at a time, populating ID, Name, Type, and ten (10) random NULLABLE columns|
|3||997 columns of INT data type, NULLABLE, non-sparse||Insert one row at a time, populating ID, Name, Type only, then update the row, adding values for ten (10) random NULLABLE columns|
|4||997 columns of INT data type, NULLABLE, sparse||Insert one row at a time, populating ID, Name, Type only, then update the row, adding values for ten (10) random NULLABLE columns|
|5||997 columns of VARCHAR data type, NULLABLE, non-sparse||Insert one row at a time, populating ID, Name, Type, and ten (10) random NULLABLE columns|
|6||997 columns of VARCHAR data type, NULLABLE, sparse||Insert one row at a time, populating ID, Name, Type, and ten (10) random NULLABLE columns|
|7||997 columns of VARCHAR data type, NULLABLE, non-sparse||Insert one row at a time, populating ID, Name, Type only, then update the row, adding values for ten (10) random NULLABLE columns|
|8||997 columns of VARCHAR data type, NULLABLE, sparse|
Each test was run twice with a data set of 10 million rows. The attached scripts can be used to replicate testing, and the steps were as follows for each test:
- Create a new database with pre-sized data and log files
- Create the appropriate table
- Snapshot wait statistics and file statistics
- Note the start time
- Execute the DML (one insert, or one insert and one update) for 10 million rows
- Note the stop time
- Snapshot wait statistics and file statistics and write to a logging table in a separate database on separate storage
- Snapshot dm_db_index_physical_stats
- Drop the database
Testing was done on a Dell PowerEdge R720 with 64GB of memory and 12GB allocated to the SQL Server 2014 SP1 CU4 instance. Fusion-IO SSDs were used for data storage for the database files.
Test results are presented below for each test scenario.
In all cases, it took less time (average 11.6 minutes) to populate the table when sparse columns were used, even when the row was first inserted, then updated. When the row was first inserted, then updated, the test took almost twice as long to run compared to when the row was inserted, as twice as many data modifications were executed.
Average duration for each test scenario
|Test ID||Average Percentage||Average Wait (seconds)|
The wait statistics were consistent for all tests and no conclusions can be made based on this data. The hardware sufficiently met the resource demands in all test cases.
Average IO (read and write) per database file
In all cases, the tests with sparse columns generated less IO (notably writes) compared to non-sparse columns.
Index Physical Stats
|Test case||Row count||Total page count (clustered index)||Total space (GB)||Average Space Used for leaf pages in CI (%)||Average Record Size (bytes)|
Significant differences exist in space use between the non-sparse and sparse tables. This is most notable when looking at test cases 1 and 3, where a fixed-length data type was used (INT), compared to test cases 5 and 7, where a variable length data type was used (VARCHAR(255)). The integer columns consumes disk space even when NULL. The variable length columns consume less disk space, as only two bytes are used in the offset array for NULL columns, and no bytes for those NULL columns that are after the last populated column in the row.
Further, the process of inserting a row and then updating it causes fragmentation for the variable-length column test (case 7), compared to simply inserting the row (case 5). The table size almost doubles when the insert is followed by the update, due to page splits which occur when updating the rows, which leaves the pages half full (versus 90% full).
In conclusion, we see a significant reduction in disk space and IO when sparse columns are used, and they perform slightly better than non-sparse columns in our simple data modification tests (note that retrieval performance should also be considered; perhaps the subject of another post).
Sparse columns have a very specific usage scenario and it's important to examine the amount of disk space saved, based on the data type for the column and the number of columns that will typically be populated in the table. In our example, we had 997 sparse columns, and we only populated 10 of them. At most, in the case where the data type used was integer, a row at the leaf level of the clustered index would consume 188 bytes (4 bytes for the ID, 100 bytes max for the Name, 4 bytes for the type, and then 80 bytes for 10 columns). When 997 columns were non-sparse, then 4 bytes were allocated for every column, even when NULL, so each row was at least 4,000 bytes at the leaf level. In our scenario, sparse columns are absolutely acceptable. But if we populated 500 or more sparse columns with values for an INT column, then the space savings is lost, and modification performance may no longer be better.
Depending on the data type for your columns, and the expected number of columns to be populated of the total, you may want to perform similar testing to ensure that, when using sparse columns, insert performance and storage are comparable or better than when using non-sparse columns. For cases when not all columns are populated, sparse columns are definitely worth considering.