November 28, 2017
The guys from the SQL Server Radio podcast posted a blog post from their episode on calculating running totals in SQL Server. Their script (link in the post) covers several different ways to calculate a running total in a query (A total of a value in the current row added to each row that came before it). The best method (in my view) is using the SUM function with the OVER clause. A lot of the other methods involved a cursor or some other messy method.
I imagine most developers familiar with T-SQL have used the OVER clause with ROW_NUMBER or some other windowing function, but the OVER clause can also be used with aggregate functions, like COUNT or SUM or AVG.
create table [dbo].[RunningTotal](
RecordId int identity(1,1) NOT NULL,
GroupId tinyint NOT NULL,
Amount decimal(6, 2) NOT NULL
insert into [dbo].[RunningTotal](GroupId, Amount) values (1, 12.34);
insert into [dbo].[RunningTotal](GroupId, Amount) values (1, 56.78);
insert into [dbo].[RunningTotal](GroupId, Amount) values (1, 55.66);
insert into [dbo].[RunningTotal](GroupId, Amount) values (2, 33.33);
insert into [dbo].[RunningTotal](GroupId, Amount) values (2, 22.10);
insert into [dbo].[RunningTotal](GroupId, Amount) values (2, 12.34);
insert into [dbo].[RunningTotal](GroupId, Amount) values (3, 98.76);
select GroupId, Amount,
sum(Amount) over(order by RecordId) as RunningSum,
sum(Amount) over(partition by GroupId order by RecordId) as RunningSumByGroup
This example gives a running total over the whole dataset, plus a running total by Group (GroupId).
June 25, 2017
I ran across a programming challenge question on Reddit, asking to create a program that will take a number and convert it to its text representation, so 11 to Eleven, 103 to One Hundred Three, etc.
I’ve posted a script on Github to accomplish this in T-SQL.
Previously, I had created a similar script to convert a Roman numeral to a number.
March 15, 2017
A lot of Data Marts or reporting databases will include a Calendar or Dates table to list all days within a certain range, along with data on those days (Day of the week, quarter, etc.).
I’ve posted my script on GitHub that will generate these records.
At a previous job, we wanted to track days the office was closed so that we could measure how many business days certain functions took. Determining weekends was easy, but calculating holidays took a little more effort. You could just manually enter the holidays for each year, but it is possible to calculate the holidays for each year in one pass.
Some holidays are the same date every year, like Christmas.
UPDATE dbo.Calendar SET IsHoliday = 1 WHERE [Month] = 12 AND [Day] = 25;
Others are on a specific day of the week, like Labor Day being the first Monday in September (This script is marking US holidays).
SET IsHoliday = 1
FROM dbo.Calendar as c
ROW_NUMBER() OVER (PARTITION BY [Year] ORDER BY DateKey) AS RowNumber
WHERE [Month] = 9
AND [WeekDay] = 2
) AS z
ON z.DateKey = c.DateKey
AND z.RowNumber = 1;
This statement will find the first Monday of each September for the range of dates in the table, and mark it as a holiday.
Another method I saw used was to look for a Monday in September where the day was between 1 and 7, since the first Monday of the month will always be in that range.
Of course, the hardest holiday to calculate is Easter. Luckily, I found an algorithm on the US Naval Observatory site to determine the date. I’ve included a function in the Calendar script to calculate the date.
Another thing to consider (which I didn’t include in this script) is if a holiday falls on a weekend, a business may choose the nearest weekday to observe that holiday.
March 6, 2017
While writing a query recently, I made use of the BETWEEN operator, which will match all values within a specified range. Without thinking, I put the greater value first:
WHERE RecordId BETWEEN 100 and 90
I was somewhat surprised that no records were returned, I knew there should be matches. Reversing the order gave me the results I expected:
WHERE RecordID BETWEEN 90 and 100.
So running the query and getting the execution plan, I saw that the WHERE clause was transformed to:
WHERE RecordID >= 90 and RecordID <= 100
Seeing this, it becomes obvious why the order would matter.
MSDN – BETWEEN
January 31, 2017
One recurring argument that comes up with databases is determining a primary key for a table. Should a natural key (which may be a string value) be used, or should a integer surrogate key be generated? In terms of query performance, the common wisdom is that integer keys will perform better. I’ve always been doubtful about that claim, I can see where if values were of different sizes then the performance would be different.
I ran across this post that compared string vs integer key values for query performance.
I wanted to perform my own test to compare the performance. I’ve posted a script on Github to build out some test tables. We end up with a Customer table with a little over 2 million records. We’ll use two different queries to join to a State table, one joining on an integer StateId value, the 2nd joining on a string StateCode. The StateId will be a smallint, and the StateCode a char(2) value, so that both columns are 2 bytes. The Customer table has a clustered index on the Customer ID, with nonclustered indexes on both the StateId and StateCode columns. The State table is a heap (it only has 62 records), since ordering by either StateId or StateCode may give an advantage to that type.
So with running the two queries, the first joining on StateId is 46% of the run time, with the StateCode string join as the other 54%. So the integer column key is faster in this case. However, on examining the query plan, Hash joins were used, so certainly we can get a more efficient plan.
I went back and added INCLUDE columns to the two Customer indexes, so that all of the returned attributes are in the index leaf level. Once I made this change, the query plan showed that the queries used Merge joins, and the query cost is the same for both queries.
So my conclusion is that there is no gain in query performance when joining on strings or integers, assuming that the values are of the same size. I need to do more research on Hash joins, but it appears that a string value hash may possibly be larger than a integer value hash, which would account for the difference in query performance.
October 10, 2016
A discussion came up at my job on the most efficient way to return subtotals and totals in one row, something like:
I had assumed that PIVOT would be the most straightforward method, but others preferred using CASE statements. So which method is more efficient?
I posted a script on GitHub to build a table with 6 million records to run both kinds of queries against. The results of each will be:
The PIVOT query:
select [Red], [Blue], [White], ([Red] + [Blue] + [White]) as Total
select RecordColor from dbo.TestPivotCase
) as a
count(RecordColor) for RecordColor in ([Red], [White], [Blue])
The CASE query:
sum(case when RecordColor = 'Red' then 1 else 0 end) as [Red],
sum(case when RecordColor = 'Blue' then 1 else 0 end) as [Blue],
sum(case when RecordColor = 'White' then 1 else 0 end) as [White],
count(*) as [Total]
The plans were almost exactly similar in cost, with the PIVOT with a slightly less cost (49.7 % vs. 50.3%).
The execution plans were very similar.
So both methods have a similar execution plan and run time, with a slight edge to the PIVOT method.
August 29, 2016
At work, a colleague and I discussed the performance of inner joins and against outer joins, particularly in the case where both types of joins would return the same number of rows. So if it turned out that you always had a match for each inner record, would you pay a penalty for having a left join? In most cases an outer join would return more records, since you would usually have unmatched records.
I posted a script on GitHub to create two tables that would be joined on RecordId.
create table dbo.Test1(
RecordId int not null identity(1,1) primary key clustered,
CreatedDate datetime not null default getdate()
create table dbo.Test2(
RecordId int null,
ItemId tinyint null,
CreatedDate datetime not null default getdate()
I created two million records in Test1, and then 5 Test2 records for each one in Test1. The Test2.RecordId column was left nullable, so we wouldn’t give the optimizer any hints as far as unmatched records.
So next we return all records in two queries, one with an inner join and one with an outer join.
from dbo.Test1 as a
join dbo.Test2 as b
on b.RecordId = a.RecordId;
from dbo.Test1 as a
left join dbo.Test2 as b
on b.RecordId = a.RecordId;
The two queries run in the same amount of time with virtually the same execution plan. Since Test1 has a clustered index, the key column is already sorted, so a Merge Join is used. So in this case, we’re running down the clustered index and finding a match for each record in the Test2 table, so it makes sense that the same amount of work would be done.
Dropping the clustered primary key on Test1 results in a different execution plan, where a Hash join is used instead of the Merge join, but the run time is still the same for both queries.
Deleting every 4th row from the detail table resulted in the inner join running slightly faster, but it was a 52-49 split in the execution plan.
I was a little surprised at first there wouldn’t be a difference in query performance between the two join types, but now it makes sense that the optimizer would go through the same plan for both types of join in this case. Inner vs Outer join does make a lot of difference logically in the records returned, but if both joins will return the same number of records then the plans turn out to be the same and the execution time is the same as well.