Aaron Bertrand

Removing the default trace – Part 3

SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Paul Randal, CEO of SQLskills, writes about knee-jerk performance tuning, DBCC, and SQL Server internals.

Paul’s Posts

[ Part 1 | Part 2 | Part 3 ]

In part 1 of this series, I explained how I arrived at the conclusion that we should disable the default trace. In part 2, I showed the Extended Events session that I deployed to capture all file size change events. In this post, I want to show the views I created to make consumption of the event data easier for people, and also to explain some caveats.

Digestible views

First, I created a view that would expose the important bits from the Extended Events session data, and put it in the utility database that exists on every instance:

CREATE VIEW dbo.vwFileSizeChanges
AS
  WITH FileInfo(XEPath) AS
  (
    SELECT LEFT(BasePath,COALESCE(NULLIF(CHARINDEX(SessName,BasePath)-1,-1),0)) + SessName + N'*.xel' 
    FROM
    (
      SELECT xmlsrc.data.value(N'(@name)[1]', N'nvarchar(max)'), SessName
      FROM 
      (
        SELECT CONVERT(xml,target_data), s.[name]
          FROM sys.dm_xe_session_targets AS t
          INNER JOIN sys.dm_xe_sessions AS s
          ON s.[address] = t.event_session_address
          WHERE s.[name] = N'FileSizeChanges'
      ) AS xefile (TargetData, SessName)
      CROSS APPLY TargetData.nodes(N'//EventFileTarget/File') AS xmlsrc(data)
    ) AS InnerData(BasePath, SessName)
  ),
  SessionData([EventData]) AS 
  (
    SELECT CONVERT(xml, TargetData.event_data) FROM FileInfo CROSS APPLY 
      sys.fn_xe_file_target_read_file(FileInfo.XEPath, NULL, NULL, NULL) AS TargetData
  ), 
  src AS
  (
    SELECT 
      EndTimeUTC   = x.d.value(N'(@timestamp)[1]',                                  N'datetime2'),
      DatabaseID   = x.d.value(N'(data  [@name="database_id"]/value)[1]',           N'int'),
      [FileName]   = x.d.value(N'(data  [@name="file_name"]/value)[1]',             N'sysname'),
      Duration     = x.d.value(N'(data  [@name="duration"]/value)[1]',              N'int'),
      FileType     = x.d.value(N'(data  [@name="file_type"]/text)[1]',              N'varchar(4)'),
      Culprit      = x.d.value(N'(action[@name="sql_text"]/value)[1]',              N'nvarchar(max)'),
      IsAutomatic  = x.d.value(N'(data  [@name="is_automatic"]/value)[1]',          N'varchar(5)'),
      ChangeKB     = x.d.value(N'(data  [@name="size_change_kb"]/value)[1]',        N'bigint'),
      Principal    = x.d.value(N'(action[@name="server_principal_name"]/value)[1]', N'sysname'),
      username     = x.d.value(N'(action[@name="username"]/value)[1]',              N'sysname'),
      AppName      = x.d.value(N'(action[@name="client_app_name"]/value)[1]',       N'sysname'),
      HostName     = x.d.value(N'(action[@name="client_hostname"]/value)[1]',       N'sysname')
      --, [EventData] -- raw XML to troubleshoot specific events
    FROM SessionData CROSS APPLY EventData.nodes('/event') AS x(d)
  )
  SELECT 
    DatabaseName    = DB_NAME(DatabaseID), 
    [FileName], 
    DurationSeconds = CONVERT(decimal(18,3),Duration/1000000.0),
    StartTimeUTC    = CONVERT(datetime2(3), DATEADD(MICROSECOND, -Duration, EndTimeUTC)), 
    EndTimeUTC      = CONVERT(datetime2(3), EndTimeUTC),
    FileType, 
    Culprit = CASE WHEN Culprit IS NULL AND AppName LIKE N'Repl%' 
                   THEN AppName ELSE Culprit END, 
    IsAutomatic, 
    ChangeMB  = CONVERT(decimal(18,3), ChangeKB / 1024.0), 
    Principal = COALESCE([Principal], COALESCE(NULLIF(username,N''),N'?')),
    HostName, 
    App = CASE WHEN AppName LIKE N'%Management Studio%Query%' 
                    THEN N'SSMS - Query Window'
               WHEN AppName LIKE N'%Management Studio%'       
                    THEN N'SSMS - GUI!'
               ELSE AppName END--, [EventData] -- raw XML to troubleshoot specific events
  FROM src;

Now, when someone wants to review recent file size change events on any server, they run:

SELECT 
  FROM UtilityDatabase.dbo.vwFileSizeChanges
  ORDER BY StartTimeUTC DESC;

When you disable the default trace, the trace files don't get deleted, so any events from before that change are still reviewable. I can borrow from the fact that the default trace is hard-coded to the same path as SERVERPROPERTY(N'ErrorLogFileName'), and create a second view that unions the above data with more data from the default trace:

CREATE VIEW dbo.vwFileSizeChanges_IncludingTrace
AS
  WITH dst AS
  (
    SELECT s,e,d 
      FROM (VALUES ('20190310','20191103',240),('20191103','20200308',300),
                   ('20200308','20201101',240),('20201101','20210314',300),
                   ('20210314','20211107',240)) AS dst(s,e,d)
      -- arbitrary date range to support DST conversions going a year+ each way
      -- will add 2022, 2023, etc. later (if DST is still a thing then)
  ),vars(TracePath) AS
  (
    SELECT REVERSE(SUBSTRING(p, CHARINDEX(N'\', p), 260)) + N'log.trc' FROM 
      (SELECT REVERSE((CONVERT(nvarchar(max), SERVERPROPERTY(N'ErrorLogFileName'))))) AS s(p)
  ), 
  trc AS
  (
    SELECT 
      t.DatabaseName, 
      t.[FileName], 
      DurationSeconds = CONVERT(decimal(18,3), t.Duration/1000000.0),
      StartTimeUTC = CONVERT(datetime2(3), DATEADD(MINUTE, COALESCE(st1.d,0), t.StartTime)),
      EndTimeUTC   = CONVERT(datetime2(3), DATEADD(MINUTE, COALESCE(st2.d,0), t.EndTime)),
      FileType = CASE WHEN t.EventClass IN (92, 94) THEN 'Data' 
	                  WHEN t.EventClass IN (93, 95) THEN 'Log' END,
      Culprit  = CASE WHEN t.TextData IS NULL AND t. ApplicationName LIKE N'Repl%' 
                      THEN t.ApplicationName ELSE t.TextData END, 
      IsAutomatic = 'true', 
      ChangeMB = CONVERT(bigint, t.IntegerData)*8/1024, 
      Principal = t.LoginName, 
      t.HostName, 
      App = CASE WHEN t.ApplicationName LIKE N'%Management Studio%Query%' 
                      THEN N'SSMS - Query Window'
                 WHEN t.ApplicationName LIKE N'%Management Studio%'
                      THEN N'SSMS - GUI!'
                 ELSE t.ApplicationName END --, [EventData] = CONVERT(xml, NULL)
    FROM vars CROSS APPLY sys.fn_trace_gettable(vars.TracePath, DEFAULT) AS t
    LEFT OUTER JOIN dst AS st1 ON  t.StartTime >= DATEADD(HOUR,2,st1.s) 
                               AND t.StartTime <  DATEADD(HOUR,2,st1.e)
    LEFT OUTER JOIN dst AS st2 ON  t.EndTime   >= DATEADD(HOUR,2,st2.s) 
                               AND t.EndTime   <  DATEADD(HOUR,2,st2.e)
    WHERE t.EventClass IN (92,93)
  )
  SELECT src='trace', * FROM trc
  UNION ALL
  SELECT src='xe',    * FROM dbo.vwFileSizeChanges;

This view adjusts the trace data (captured in Eastern time on all of our servers) to UTC, and handles DST where appropriate as well. If the data falls outside the range of the CTE called dst, it will be expressed in Eastern time instead (and you can easily fix this by adding more DST ranges). There is an additional column called src so you can query the old trace data using:

SELECT 
  FROM UtilityDatabase.dbo.vwFileSizeChanges_IncludingTrace
  WHERE src = 'trace'
  ORDER BY StartTimeUTC DESC;

Caveats

There's no such thing as a free lunch! While I am confident that removing the default trace will have zero or, much more likely, a positive impact on our workloads, there are some things to keep in mind for your environment if you choose to follow my path:

  1. Databases are not permanent

    In my Extended Events session definition, I chose not to implement collect_database_name, and in the view above you can see that I resolve this at runtime using DB_NAME(database_id). There is a risk here, in that someone could create a database, perform a bunch of activity that creates file churn and disk thrashing, then drop the database. The database_id exposed in the XML is no longer meaningful in this case, and DB_NAME() will return NULL.

    I chose this outcome over relying solely on database name, because the above chain of events is far less likely than a database rename (where a database_id will stay the same). In that case, you might be looking for events that happened to a database, but searching using the wrong (current) name, depending on when the events happened.

    If you want to be able to use one or the other, you could use the following session definition instead:

    ...
    ADD EVENT sqlserver.database_file_size_change
    (
      SET collect_database_name = (1)  
      ACTION
      (
        sqlserver.sql_text,
    ...

    This can't be free either, or it would happen by default, but I will confess I've never tested the impact of adding that to the collection.

  2. SSMS Reports will be slightly less reliable

    A side effect of disabling the default trace is disrupting some of Management Studio's "Standard Reports." I polled our teams before I did this, and you'll want to do the same to make sure your users don't rely on any of these. You'll also want to remind them that the reports currently can't be relied upon anyway, for the same reason I can't rely on the default trace directly - they can only pull data that's still in the trace. An empty report doesn't necessarily mean that no events happened; it could smply mean that the information is no longer available. If this is really where a team wanted to consume this information, I could make sure it is reliable by shipping them custom reports that use a more trustworthy source.

    The following reports are the ones that I could see derive at least some of their information from the default trace, and why we don't need the reports even if they could be trusted:

    Schema Changes History We already have source control, a rigorous review / deploy process, and DDL triggers in place that capture information about schema changes.
    Configuration Changes History
    and
    Memory Consumption
    Our monitoring tool tells us about instance-level configuration changes, so this report in SSMS is redundant.
    Login Failures These are in the error log (and event viewer) because, as a standard, we enable "Failed logins only" auditing for all SQL Server instances. Some servers have additional formal auditing for compliance reasons as well.
    Disk Usage Among other information, this lists the autogrowth and autoshrink events from the default trace, which we now capture using Extended Events.
    Backup and Restore Events This information is readily available in msdb.dbo.backupset if we ever need it, but this is also rolled into our automation around backup and restore (we are never looking at the default trace for this information).
    Database Consistency History As with backups, we have automation built around DBCC CHECKDB; if someone went outside of this and ran something manually, it will still show up in the error log. And we have much more control over how long we keep error logs, and how often we recycle them. We recycle nightly so that it is easier to find an event that we suspect happened on a given day in the past.

Conclusion

This was a fun but complicated project, and I am satisfied with the outcome so far. Thanks for riding along!

[ Part 1 | Part 2 | Part 3 ]