Itzik Ben-Gan

Bucketizing date and time data

August 11, 2021 by in T-SQL Queries | 8 Comments
SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Paul White is an independent SQL Server consultant specializing in performance tuning, execution plans, and the query optimizer.

Paul’s Posts

Bucketizing date and time data involves organizing data in groups representing fixed intervals of time for analytical purposes. Often the input is time series data stored in a table where the rows represent measurements taken at regular time intervals. For example, the measurements could be temperature and humidity readings taken every 5 minutes, and you want to group the data using hourly buckets and compute aggregates like average per hour. Even though time series data is a common source for bucket-based analysis, the concept is just as relevant to any data that involves date and time attributes and associated measures. For example, you might want to organize sales data in fiscal year buckets and compute aggregates like the total sales value per fiscal year. In this article, I cover two methods for bucketizing date and time data. One is using a function called DATE_BUCKET, which at the time of writing is only available in Azure SQL Edge. Another is using a custom calculation that emulates the DATE_BUCKET function, which you can use in any version, edition, and flavor of SQL Server and Azure SQL Database.

In my examples, I’ll use the sample database TSQLV5. You can find the script that creates and populates TSQLV5 here and its ER diagram here.

DATE_BUCKET

As mentioned, the DATE_BUCKET function is currently available only in Azure SQL Edge. SQL Server Management Studio already has IntelliSense support, as shown in Figure 1:

Figure 1: Intellisence support for DATE_BUCKET in SSMSFigure 1: Intellisence support for DATE_BUCKET in SSMS

The function’s syntax is as follows:

DATE_BUCKET ( <date part>, <bucket width>, <timestamp>[, <origin>] )

The input origin represents an anchor point on the arrow of time. It can be of any of the supported date and time data types. If unspecified, the default is 1900, January 1st, midnight. You can then imagine the timeline as being divided into discrete intervals starting with the origin point, where the length of each interval is based on the inputs bucket width and date part. The former is the quantity and the latter is the unit. For example, to organize the timeline in 2-month units, you would specify 2 as the bucket width input and month as the date part input.

The input timestamp is an arbitrary point in time that needs to be associated with its containing bucket. Its data type needs to match the data type of the input origin. The input timestamp is the date and time value associated with the measures you’re capturing.

The output of the function is then the starting point of the containing bucket. The data type of the output is that of the input timestamp.

If it wasn’t obvious already, usually you would use the DATE_BUCKET function as a grouping set element in the query’s GROUP BY clause and naturally return it in the SELECT list as well, along with aggregated measures.

Still a bit confused about the function, its inputs, and its output? Maybe a specific example with a visual depiction of the function’s logic would help. I’ll start with an example that uses input variables and later in the article demonstrate the more typical way you would use it as part of a query against an input table.

Consider the following example:

DECLARE 
  @timestamp   AS DATETIME2 = '20210510 06:30:00',
  @bucketwidth AS INT = 2,
  @origin      AS DATETIME2 = '20210115 00:00:00';

SELECT DATE_BUCKET(month, @bucketwidth, @timestamp, @origin);

You can find a visual depiction of the function’s logic in Figure 2.

Figure 2: Visual depiction of the DATE_BUCKET function’s logic

As you can see in Figure 2, the origin point is the DATETIME2 value January 15th, 2021, midnight. If this origin point seems a bit odd, you’d be right in intuitively sensing that normally you would use a more natural one like the beginning of some year, or the beginning of some day. In fact, you’d often be satisfied with the default, which as you recall is January 1st, 1900 at midnight. I intentionally wanted to use a less trivial origin point to be able to discuss certain complexities that might not be relevant when using a more natural one. More on this shortly.

The timeline is then divided into discrete 2-month intervals starting with the origin point. The input timestamp is the DATETIME2 value May 10th, 2021, 6:30 AM.

Observe that the input timestamp is part of the bucket that starts on March 15th, 2021, midnight. Indeed, the function returns this value as a DATETIME2-typed value:

---------------------------
2021-03-15 00:00:00.0000000

Emulating DATE_BUCKET

Unless you’re using Azure SQL Edge, if you want to bucketize date and time data, for the time being you would need to create your own custom solution to emulate what the DATE_BUCKET function does. Doing so is not overly complex, but it’s not too simple either. Dealing with date and time data often involves tricky logic and pitfalls that you need to be careful about.

I’ll build the calculation in steps and use the same inputs I used with the DATE_BUCKET example I showed earlier:

DECLARE 
  @timestamp   AS DATETIME2 = '20210510 06:30:00',
  @bucketwidth AS INT = 2,
  @origin      AS DATETIME2 = '20210115 00:00:00';

Make sure you include this part before each of the code samples I’ll show if you actually want to run the code.

In Step 1, you use the DATEDIFF function to compute the difference in date part units between origin and timestamp. I’ll refer to this difference as diff1. This is done with the following code:

SELECT DATEDIFF(month, @origin, @timestamp) AS diff1;

With our sample inputs, this expression returns 4.

The tricky part here is that you need to compute how many whole units of date part exist between origin and timestamp. With our sample inputs, there are 3 whole months between the two and not 4. The reason the DATEDIFF function reports 4 is that, when it computes the difference, it only looks at the requested part of the inputs and higher parts but not lower parts. So, when you ask for the difference in months, the function only cares about the year and month parts of the inputs and not about the parts below the month (day, hour, minute, second, etc.). Indeed, there are 4 months between January 2021 and May 2021, yet only 3 whole months between the full inputs.

The purpose of Step 2 then is to compute how many whole units of date part exist between origin and timestamp. I’ll refer to this difference as diff2. To achieve this, you can add diff1 units of date part to origin. If the result is greater than timestamp, you subtract 1 from diff1 to compute diff2, otherwise subtract 0 and therefore use diff1 as diff2. This can be done using a CASE expression, like so:

SELECT
  DATEDIFF(month, @origin, @timestamp)
    - CASE
        WHEN DATEADD(
               month, 
               DATEDIFF(month, @origin, @timestamp), 
               @origin) > @timestamp 
          THEN 1
        ELSE 0
      END AS diff2;

This expression returns 3, which is the number of whole months between the two inputs.

Recall that earlier I mentioned that in my example I intentionally used an origin point that is not a natural one like a round beginning of a period so that I can discuss certain complexities that then might be relevant. For instance, if you use month as the date part, and the exact beginning of some month (1st of some month at midnight) as the origin, you can safely skip Step 2 and use diff1 as diff2. That’s because origin + diff1 can never be > timestamp in such a case. However, my goal is to provide a logically equivalent alternative to the DATE_BUCKET function that would work correctly for any origin point, common or not. So, I'll include the logic for Step 2 in my examples, but just remember when you identify cases where this step isn’t relevant, you can safely remove the part where you subtract the output of the CASE expression.

In Step 3 you identify how many units of date part there are in whole buckets that exist between origin and timestamp. I’ll refer to this value as diff3. This can be done with the following formula:

diff3 = diff2 /  * 

The trick here is that when using the division operator / in T-SQL with integer operands, you get integer division. For instance, 3 / 2 in T-SQL is 1 and not 1.5. The expression diff2 / <bucket width> gives you the number of whole buckets that exist between origin and timestamp (1 in our case). Multiplying the result by bucket width then gives you the number of units of date part that exist within those whole buckets (2 in our case). This formula translates to the following expression in T-SQL:

SELECT
  ( DATEDIFF(month, @origin, @timestamp)
      - CASE
          WHEN DATEADD(
                 month, 
                 DATEDIFF(month, @origin, @timestamp), 
                 @origin) > @timestamp 
            THEN 1
          ELSE 0
        END )  / @bucketwidth * @bucketwidth AS diff3;

This expression returns 2, which is the number of months in the whole 2-month buckets that exist between the two inputs.

In Step 4, which is the final step, you add diff3 units of date part to origin to compute the start of the containing bucket. Here’s the code to achieve this:

SELECT DATEADD(
         month, 
         ( DATEDIFF(month, @origin, @timestamp)
             - CASE
                 WHEN DATEADD(
                        month, 
                        DATEDIFF(month, @origin, @timestamp), 
                        @origin) > @timestamp 
                   THEN 1
                 ELSE 0
               END ) / @bucketwidth * @bucketwidth,
         @origin);

This code generates the following output:

---------------------------
2021-03-15 00:00:00.0000000

As you recall, this is the same output produced by the DATE_BUCKET function for the same inputs.

I suggest that you try this expression with various inputs and parts. I’ll show a few examples here, but feel free to try your own.

Here’s an example where origin is just slightly ahead of timestamp in the month:

DECLARE 
  @timestamp   AS DATETIME2 = '20210510 06:30:00',
  @bucketwidth AS INT = 2,
  @origin      AS DATETIME2 = '20210110 06:30:01';

-- SELECT DATE_BUCKET(week, @bucketwidth, @timestamp, @origin);

SELECT DATEADD(
         month, 
         ( DATEDIFF(month, @origin, @timestamp)
             - CASE
                 WHEN DATEADD(
                        month, 
                        DATEDIFF(month, @origin, @timestamp), 
                        @origin) > @timestamp 
                   THEN 1
                 ELSE 0
               END ) / @bucketwidth * @bucketwidth,
         @origin);

This code generates the following output:

---------------------------
2021-03-10 06:30:01.0000000

Notice that the start of the containing bucket is in March.

Here’s an example where origin is at the same point within the month as timestamp:

DECLARE 
  @timestamp   AS DATETIME2 = '20210510 06:30:00',
  @bucketwidth AS INT = 2,
  @origin      AS DATETIME2 = '20210110 06:30:00';

-- SELECT DATE_BUCKET(week, @bucketwidth, @timestamp, @origin);

SELECT DATEADD(
         month, 
         ( DATEDIFF(month, @origin, @timestamp)
             - CASE
                 WHEN DATEADD(
                        month, 
                        DATEDIFF(month, @origin, @timestamp), 
                        @origin) > @timestamp 
                   THEN 1
                 ELSE 0
               END ) / @bucketwidth * @bucketwidth,
         @origin);

This code generates the following output:

---------------------------
2021-05-10 06:30:00.0000000

Notice that this time the start of the containing bucket is in May.

Here’s an example with 4-week buckets:

DECLARE 
  @timestamp   AS DATETIME2 = '20210303 21:22:11',
  @bucketwidth AS INT = 4,
  @origin      AS DATETIME2 = '20210115';

-- SELECT DATE_BUCKET(week, @bucketwidth, @timestamp, @origin);

SELECT DATEADD(
         week, 
         ( DATEDIFF(week, @origin, @timestamp)
             - CASE
                 WHEN DATEADD(
                        week, 
                        DATEDIFF(week, @origin, @timestamp), 
                        @origin) > @timestamp 
                   THEN 1
                 ELSE 0
               END ) / @bucketwidth * @bucketwidth,
         @origin);

Notice that the code uses the week part this time.

This code generates the following output:

---------------------------
2021-02-12 00:00:00.0000000

Here’s an example with 15-minute buckets:

DECLARE 
  @timestamp   AS DATETIME2 = '20210203 21:22:11',
  @bucketwidth AS INT = 15,
  @origin      AS DATETIME2 = '19000101';
  
-- SELECT DATE_BUCKET(minute, @bucketwidth, @timestamp);

SELECT DATEADD(
         minute, 
         ( DATEDIFF(minute, @origin, @timestamp)
             - CASE
                 WHEN DATEADD(
                        minute, 
                        DATEDIFF(minute, @origin, @timestamp), 
                        @origin) > @timestamp 
                   THEN 1
                 ELSE 0
               END ) / @bucketwidth * @bucketwidth,
         @origin);

This code generates the following output:

---------------------------
2021-02-03 21:15:00.0000000

Notice that the part is minute. In this example, you want to use 15-minute buckets starting at the bottom of the hour, so an origin point that is the bottom of any hour would work. In fact, an origin point that has a minute unit of 00, 15, 30 or 45 with zeros in lower parts, with any date and hour would work. So the default that the DATE_BUCKET function uses for the input origin would work. Of course, when using the custom expression, you have to be explicit about the origin point. So, to sympathize with the DATE_BUCKET function, you could use the base date at midnight like I do in the above example.

Incidentally, can you see why this would be a good example where it is perfectly safe to skip Step 2 in the solution? If you indeed chose to skip Step 2, you get the following code:

DECLARE 
  @timestamp   AS DATETIME2 = '20210203 21:22:11',
  @bucketwidth AS INT = 15,
  @origin      AS DATETIME2 = '19000101';

-- SELECT DATE_BUCKET(minute, @bucketwidth, @timestamp);

SELECT DATEADD( 
    minute, 
    ( DATEDIFF( minute, @origin, @timestamp ) ) / @bucketwidth * @bucketwidth, 
    @origin 
);

Clearly, the code becomes significantly simpler when Step 2 is not needed.

Grouping and aggregating data by date and time buckets

There are cases where you need to bucketize date and time data that don’t require sophisticated functions or unwieldy expressions. For example, suppose you want to query the Sales.OrderValues view in the TSQLV5 database, group the data yearly, and compute total order counts and values per year. Clearly, it’s enough to use the YEAR(orderdate) function as the grouping set element, like so:

USE TSQLV5;

SELECT
  YEAR(orderdate) AS orderyear,
  COUNT(*) AS numorders,
  SUM(val) AS totalvalue
FROM Sales.OrderValues
GROUP BY YEAR(orderdate)
ORDER BY orderyear;

This code generates the following output:

orderyear   numorders   totalvalue
----------- ----------- -----------
2017        152         208083.99
2018        408         617085.30
2019        270         440623.93

But what if you wanted to bucketize the data by your organization’s fiscal year? Some organizations use a fiscal year for accounting, budget, and financial reporting purposes, not aligned with the calendar year. Say, for instance, that your organization’s fiscal year operates on a fiscal calendar of October to September, and is denoted by the calendar year in which the fiscal year ends. So an event that took place on October 3rd , 2018 belongs to the fiscal year that started on October 1st, 2018, ended on September 30th, 2019, and is denoted by the year 2019.

This is quite easy to achieve with the DATE_BUCKET function, like so:

DECLARE 
  @bucketwidth AS INT = 1,
  @origin      AS DATE = '19001001'; -- this is Oct 1 of some year

SELECT 
  YEAR(startofbucket) + 1 AS fiscalyear,
  COUNT(*) AS numorders,
  SUM(val) AS totalvalue
FROM Sales.OrderValues
  CROSS APPLY ( VALUES( DATE_BUCKET( year, @bucketwidth, orderdate, @origin ) ) ) AS A(startofbucket)
GROUP BY startofbucket
ORDER BY startofbucket;

And here’s the code using the custom logical equivalent of the DATE_BUCKET function:

DECLARE 
  @bucketwidth AS INT = 1,
  @origin      AS DATE = '19001001'; -- this is Oct 1 of some year

SELECT
  YEAR(startofbucket) + 1 AS fiscalyear,
  COUNT(*) AS numorders,
  SUM(val) AS totalvalue
FROM Sales.OrderValues
  CROSS APPLY ( VALUES( 
    DATEADD(
      year, 
      ( DATEDIFF(year, @origin, orderdate)
          - CASE
              WHEN DATEADD(
                     year, 
                     DATEDIFF(year, @origin, orderdate), 
                     @origin) > orderdate 
                THEN 1
              ELSE 0
            END ) / @bucketwidth * @bucketwidth,
      @origin) ) ) AS A(startofbucket)
GROUP BY startofbucket
ORDER BY startofbucket;

This code generates the following output:

fiscalyear  numorders   totalvalue
----------- ----------- -----------
2017        70          79728.58
2018        370         563759.24
2019        390         622305.40

I used variables here for the bucket width and the origin point to make the code more generalized, but you can replace those with constants if you’re always using the same ones, and then simplify the computation as appropriate.

As a slight variation of the above, suppose your fiscal year runs from July 15th of one calendar year to July 14th of the next calendar year, and is denoted by the calendar year that the beginning of the fiscal year belongs to. So an event that took place on July 18th, 2018 belongs to fiscal year 2018. An event that took place on July 14th, 2018 belongs to fiscal year 2017. Using the DATE_BUCKET function, you would achieve this like so:

DECLARE 
  @bucketwidth AS INT = 1,
  @origin      AS DATE = '19000715'; -- July 15 marks start of fiscal year

SELECT
  YEAR(startofbucket) AS fiscalyear, -- no need to add 1 here
  COUNT(*) AS numorders,
  SUM(val) AS totalvalue
FROM Sales.OrderValues
  CROSS APPLY ( VALUES( DATE_BUCKET( year, @bucketwidth, orderdate, @origin ) ) ) AS A(startofbucket)
GROUP BY startofbucket
ORDER BY startofbucket;

You can see the changes compared to the previous example in the comments.

And here’s the code using the custom logical equivalent to the DATE_BUCKET function:

DECLARE 
  @bucketwidth AS INT = 1,
  @origin      AS DATE = '19000715';

SELECT
  YEAR(startofbucket) AS fiscalyear,
  COUNT(*) AS numorders,
  SUM(val) AS totalvalue
FROM Sales.OrderValues
  CROSS APPLY ( VALUES( 
    DATEADD(
      year, 
      ( DATEDIFF(year, @origin, orderdate)
          - CASE
              WHEN DATEADD(
                     year, 
                     DATEDIFF(year, @origin, orderdate), 
                     @origin) > orderdate 
                THEN 1
              ELSE 0
            END ) / @bucketwidth * @bucketwidth,
      @origin) ) ) AS A(startofbucket)
GROUP BY startofbucket
ORDER BY startofbucket;

This code generates the following output:

fiscalyear  numorders   totalvalue
----------- ----------- -----------
2016        8           12599.88
2017        343         495118.14
2018        479         758075.20

Obviously, there are alternative methods you could use in specific cases. Take the example before the last, where the fiscal year runs from October to September and denoted by the calendar year in which the fiscal year ends. In such a case you could use the following, much simpler, expression:

YEAR(orderdate) + CASE WHEN MONTH(orderdate) BETWEEN 10 AND 12 THEN 1 ELSE 0 END

And then your query would look like this:

SELECT
  fiscalyear,
  COUNT(*) AS numorders,
  SUM(val) AS totalvalue
FROM Sales.OrderValues
  CROSS APPLY ( VALUES( 
    YEAR(orderdate)
      + CASE 
          WHEN MONTH(orderdate) BETWEEN 10 AND 12 THEN 1 
          ELSE 0 
        END ) ) AS A(fiscalyear)
GROUP BY fiscalyear
ORDER BY fiscalyear;

However, if you want a generalized solution that would work in many more cases, and that you could parameterize, you would naturally want to use the more general form. If you have access to the DATE_BUCKET function, that’s great. If you don’t, you can use the custom logical equivalent.

Conclusion

The DATE_BUCKET function is quite a handy function that enables you to bucketize date and time data. It’s useful for handling time series data, but also for bucketizing any data that involves date and time attributes. In this article I explained how the DATE_BUCKET function works and provided a custom logical equivalent in case the platform you’re using doesn’t support it.