Erin Stellato

The Price of Not Purging

SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Erin Stellato is a Principal Consultant with SQLskills and a Microsoft Data Platform MVP.

Erin’s Posts

I’m in the process of de-cluttering my house (too late in the summer to try and pass it off as spring cleaning). You know, cleaning out closets, going through the kids’ toys, and organizing the basement. It’s a painful process.  When we moved into our house 10 years ago we had SO much room. Now I feel like there’s stuff everywhere, and it makes it harder to find what I’m really looking for and it takes longer and longer to clean up and organize.

Does this sound like any database you manage?

Many clients that I’ve worked with deal with purging data as an afterthought. At the time of the implementation, everyone wants to save everything. “We never know when we might need it.” After a year or two someone realizes there’s a lot of extra stuff in the database, but now people are afraid to get rid of it. “We need to check with Legal to see if we can delete it.” But no one checks with Legal, or if someone does, Legal goes back to the business owners to ask what to keep, and then the project grinds to a halt. “We cannot come to a consensus about what can be deleted.” The project is forgotten, and then two or four years down the road, the database is suddenly a terabyte, difficult to manage, and people blame all performance issues on database size. You hear the words “partitioning” and “archive database” thrown around, and sometimes you just get to delete a bunch of data, which has its own issues.

Ideally you should decide on your purge strategy before implementation, or within the first six to twelve months of go-live. But since we’re past that stage, let’s look at what impact this extra data can have.

Test Methodology

To set the stage, I took a copy of the Credit database and restored it to my SQL Server 2012 instance. I dropped the three existing nonclustered indexes and added two of my own:

USE [master];
GO

RESTORE DATABASE [Credit]
FROM DISK = N'C:\SQLskills\SampleDatabases\Credit\CreditBackup100.bak'
WITH FILE = 1,
MOVE N'CreditData' TO N'D:\Databases\SQL2012\CreditData.mdf',
MOVE N'CreditLog' TO N'D:\Databases\SQL2012\CreditLog.ldf',
STATS = 5;
GO

ALTER DATABASE [Credit] MODIFY FILE ( NAME = N'CreditData', SIZE = 14680064KB , FILEGROWTH = 524288KB );
GO
ALTER DATABASE [Credit] MODIFY FILE ( NAME = N'CreditLog', SIZE = 2097152KB , FILEGROWTH = 524288KB );
GO

USE [Credit];
GO

DROP INDEX [dbo].[charge].[charge_category_link];
DROP INDEX [dbo].[charge].[charge_provider_link];
DROP INDEX [dbo].[charge].[charge_statement_link];

CREATE NONCLUSTERED INDEX [charge_chargedate] ON [dbo].[charge] ([charge_dt]);
CREATE NONCLUSTERED INDEX [charge_provider] ON [dbo].[charge] ([provider_no]);

I then increased the number of rows in the table to 14.4 million, by re-inserting the original set of rows multiple times, modifying the dates slightly:

INSERT INTO [dbo].[charge] 
(
 [member_no],
 [provider_no],
 [category_no],
 [charge_dt],
 [charge_amt],
 [statement_no],
 [charge_code]
)
SELECT
 [member_no],
 [provider_no],
 [category_no],
 [charge_dt] - 175,
 [charge_amt],
 [statement_no],
 [charge_code]
FROM [dbo].[charge]
 WHERE [charge_no] BETWEEN 1 AND 2000000;
GO 3

INSERT INTO [dbo].[charge] 
(
 [member_no],
 [provider_no],
 [category_no],
 [charge_dt],
 [charge_amt],
 [statement_no],
 [charge_code]
)
SELECT
 [member_no],
 [provider_no],
 [category_no],
 [charge_dt],
 [charge_amt],
 [statement_no],
 [charge_code]
FROM [dbo].[charge]
 WHERE [charge_no] BETWEEN 1 AND 2000000;
GO 2

INSERT INTO [dbo].[charge] 
(
 [member_no],
 [provider_no],
 [category_no],
 [charge_dt],
 [charge_amt],
 [statement_no],
 [charge_code]
)
SELECT
 [member_no],
 [provider_no],
 [category_no],
 [charge_dt] + 79,
 [charge_amt],
 [statement_no],
 [charge_code]
FROM [dbo].[charge]
 WHERE [charge_no] BETWEEN 1 AND 2000000;
GO 3

Finally, I set up a test harness to execute a series of statements against the database four times each. The statements are below:

ALTER INDEX ALL ON [dbo].[charge] REBUILD;

DBCC CHECKDB (Credit) WITH ALL_ERRORMSGS, NO_INFOMSGS;

BACKUP DATABASE [Credit]
TO DISK = N'D:\Backups\SQL2012\Credit.bak'
WITH NOFORMAT,
 INIT,
 NAME = N'Credit-Full Database Backup',
 STATS = 10;

SELECT [charge_no], [member_no], [charge_dt], [charge_amt]
 FROM [dbo].[charge]
 WHERE [charge_no] = 841345;

DECLARE @StartDate DATETIME = '1999-07-01';
DECLARE @EndDate DATETIME = '1999-07-31';

SELECT [charge_dt], COUNT([charge_dt])
 FROM [dbo].[charge]
 WHERE [charge_dt] BETWEEN @StartDate AND @EndDate
 GROUP BY [charge_dt];

SELECT [provider_no], COUNT([provider_no])
 FROM [dbo].[charge]
 WHERE [provider_no] = 475
 GROUP BY [provider_no];

SELECT [provider_no], COUNT([provider_no])
 FROM [dbo].[charge]
 WHERE [provider_no] = 140
 GROUP BY [provider_no];

Before each statement I executed

DBCC DROPCLEANBUFFERS;
GO

to clear the buffer pool. Obviously this is not something to execute against a production environment. I did it here to provide a consistent starting point for each test.

After each execution, I increased the size of the dbo.charge table by inserting the 14.4 million rows I started with, but I increased the charge_dt by one year for each execution. For example:

INSERT INTO [dbo].[charge] 
(
 [member_no],
 [provider_no],
 [category_no],
 [charge_dt],
 [charge_amt],
 [statement_no],
 [charge_code]
)
SELECT
 [member_no],
 [provider_no],
 [category_no],
 [charge_dt] + 365,
 [charge_amt],
 [statement_no],
 [charge_code]
FROM [dbo].[charge]
WHERE [charge_no] BETWEEN 1 AND 14800000;
GO

After the addition of 14.4 million rows, I re-ran the test harness. I repeated this six times, essentially adding six “years” of data. The dbo.charge table started with data from 1999, and after the repeated inserts contained data through 2005.

Results

The results from the executions can be seen here:

Duration for Maintenance Tasks
Duration for Maintenance Tasks

Duration for Queries
Duration for Queries

The individual statements executed reflect typical database activity. Index rebuilds, integrity checks and backups are part of regular database maintenance. The queries against the charge table represent a singleton lookup as well as three variations of range scans specific to the data in the table.

Index Rebuilds, CHECKDB, and Backups

As expected for the maintenance tasks, duration and IO values increased as more rows were added to the database. The database size increased by a factor of 10, and while the durations did not increase at the same rate, a consistent increase was seen. Each maintenance task initially took less than 20 seconds to complete, but as more rows were added, duration for the tasks increased to almost 1 minute and 20 seconds for 100 million rows (and to over 2 minutes for the index rebuild). This reflects the additional time SQL Server required to complete the task due to additional data.

Singleton Lookup

The query against dbo.charge for a specific charge_no always produced one row – and would have produced one row regardless of the value used, as charge_no is a unique identity. There is minimal variation for this lookup. As rows are continually added to the table, the index may increase in depth by one or two levels (more as the table gets wider), therefore adding a couple IOs, but this is a singleton lookup with very few IOs.

Range Scans

The query for a date range (charge_dt) was modified after each insert to search the most recent year’s data for July (e.g. '2005-07-01' to '2005-07-01' for the last set of tests), but returned just over 1.2 million rows each time. In a real-world scenario, we wouldn’t expect the same number of rows to be returned for the same month, year over year, nor would we expect the same number of rows to be returned for every month in a year. But row counts could stay within the same range between months, with slight increases over time. There exist fluctuations in duration for this query, but a review of the IO data captured from sys.dm_io_virtual_file_stats shows consistency in the number of reads.

Query IO
Query IO

The final two queries, for two different provider_no values, show the true effect of keeping data. In the initial dbo.charge table, provider_no 475 had over 126,000 rows and provider_no 140 had over 1700 rows. For each 14.4 million rows that were added, approximately the same number of rows for each provider_no was added.   In a production environment, this type of data distribution is not uncommon, and queries for this data may perform well in the first years of the solution, but may degrade over time as more rows are added. Query duration increases by a factor of five (from 31 ms to 153 ms) between the initial and final execution for provider_no 475. While this impact may not seem significant, note the parallel increase in IO (above). If this were a query that executed with high frequency, and/or there were similar queries that executed with regular frequency, the additional load can add up and affect overall resource usage. Further, consider the impact when you’re working with tables that have billions of rows, and are used in queries with complex joins, and the impact on your regular – and extremely critical – maintenance tasks. Finally, take into account recoverability time. Your disaster recovery plan should be based on restore times, and as database size grows, the database will take longer to restore in its entirety.  If you’re not regularly testing and timing your restores, recovering from a disaster could take longer than you thought.

Summary

The examples shown here are simple illustrations of what can happen when a data archiving strategy is not determined during database implementation, and there are many other scenarios to explore and test. Old data which is rarely, if ever, accessed impacts more than just space on disk. It can affect query performance and duration of maintenance tasks. As a DBA managing multiple databases on an instance, one database that holds historical data can affect the performance and maintenance tasks of other databases. Further, if reports execute against historical data, this can wreak havoc on already-busy OLTP environment.

From the beginning, it’s critical that the lifespan of data in a database is determined, and a plan of action put in place. For some solutions, it is required to keep all data forever. In this case, employ strategies to keep database size manageable, for example: archive the data to a separate table or separate database on a regular basis. In the event that data does not need to be stored for years and years, implement a purging strategy that removes data on a regular basis. In this manner, you can throw out the toys that are no longer played with, clothes that no longer fit, and random junk that you just don’t use every three months…rather than once every 10 years.