One recurring argument that comes up with databases is determining a primary key for a table. Should a natural key (which may be a string value) be used, or should a integer surrogate key be generated? In terms of query performance, the common wisdom is that integer keys will perform better. I’ve always been doubtful about that claim, I can see where if values were of different sizes then the performance would be different.
I ran across this post that compared string vs integer key values for query performance.
I wanted to perform my own test to compare the performance. I’ve posted a script on Github to build out some test tables. We end up with a Customer table with a little over 2 million records. We’ll use two different queries to join to a State table, one joining on an integer StateId value, the 2nd joining on a string StateCode. The StateId will be a smallint, and the StateCode a char(2) value, so that both columns are 2 bytes. The Customer table has a clustered index on the Customer ID, with nonclustered indexes on both the StateId and StateCode columns. The State table is a heap (it only has 62 records), since ordering by either StateId or StateCode may give an advantage to that type.
So with running the two queries, the first joining on StateId is 46% of the run time, with the StateCode string join as the other 54%. So the integer column key is faster in this case. However, on examining the query plan, Hash joins were used, so certainly we can get a more efficient plan.
I went back and added INCLUDE columns to the two Customer indexes, so that all of the returned attributes are in the index leaf level. Once I made this change, the query plan showed that the queries used Merge joins, and the query cost is the same for both queries.
So my conclusion is that there is no gain in query performance when joining on strings or integers, assuming that the values are of the same size. I need to do more research on Hash joins, but it appears that a string value hash may possibly be larger than a integer value hash, which would account for the difference in query performance.