T-SQL: Writing high-performance, low-maintenance queries

There once was a time when I would have said that you can't write queries that are both high-performance and easily maintainable. I believed that temp tables were a crutch for the weak, and that a gigantic query with lots of nested subqueries would run more efficiently than the same query broken into multiple parts, using intermediate temp tables to pass information from one query to the next. And now I'm here to tell you that I was wrong.

Experience has taught me that big, monolithic queries are not only much more difficult to maintain, they run slower, too. You see, a given query can be executed in a multitude of different ways, and SQL Server analyzes each query by trying to guess which combination of ways will be the least expensive to execute. The problem with giving SQL Server too big of a query is that the possible combinations to analyze become too many, and the analyzer gets "confused", often choosing a sub-optimal execution plan. The solution is to decompose that big query into a series of smaller ones, which is the main topic of this article.

To illustrate the concept of query decomposition, let's start with a contrived customer-order database query:

select c.Name, o.Amount_sum
from Customers c
join (
  select CustomerID, sum( Amount ) as Amount_sum
  from Orders
  group by CustomerID
) o on o.CustomerID = c.ID

Ordinarily, a query this trivial wouldn't need to be decomposed because it's already pretty simple in spite of containing a derived table, so let's pretend that this is just part of a much larger query, and we're only looking at one tiny part, OK?

The first thing I always try when I need to boost the performance of large queries with lots of nested queries like this is perform what amounts to a refactoring that I'll name Extract Derived Table. The steps are:

  1. Create a temp table with a schema matching that of the derived table's columns
    create table #orders
    (
      CustomerID int,
      Amount_sum money,
      primary key ( CustomerID )
    )
    
    select c.Name, o.Amount_sum
    from Customers c
    join (
      select CustomerID, sum( Amount ) as Amount_sum
      from Orders
      group by CustomerID
    ) o on o.CustomerID = c.ID
  2. Copy the derived table statement and use it to insert into the temp table
    create table #orders
    (
      CustomerID int,
      Amount_sum money,
      primary key ( CustomerID )
    )
    
    insert #orders
    select CustomerID, sum( Amount ) as Amount_sum
    from Orders
    group by CustomerID
    
    select c.Name, o.Amount_sum
    from Customers c
    join (
      select CustomerID, sum( Amount ) as Amount_sum
      from Orders
      group by CustomerID
    ) o on o.CustomerID = c.ID
  3. Replace the derived table statement with the temp table in the outer select statement
    create table #orders
    (
      CustomerID int,
      Amount_sum money,
      primary key ( CustomerID )
    )
    
    insert #orders
    select CustomerID, sum( Amount ) as Amount_sum
    from Orders
    group by CustomerID
    
    select c.Name, o.Amount_sum
    from Customers c
    join #orders o on o.CustomerID = c.ID

One objection that an onlooker might raise at this point is that the result of the refactoring is longer than the original implementation. This is true, but in my experience, the tradeoff is worth it, both in performance as well as readability, as this technique eliminates levels of indentation, which become a significant obstacle to readability in queries with many nested subqueries.

Now let's throw ourselves a curveball: what if the final query has a WHERE clause that would have served to reduce the number of rows returned by the inner query as well. SQL Server is smart enough to figure out how to optimize the execution of the inner query when the outer query is restricted, but not after we've decomposed the queries like this. Let's start by just slapping a WHERE clause on the final query:

select c.Name, o.Amount_sum
from Customers c
join #orders o on o.CustomerID = c.ID
where c.Name like @Name

What's going to happen now is that the entire Orders table will be summarized into the #orders temp table, but then only a single row out of it will be used in the final output. Seems like a waste, doesn't it? We could fix it by putting something like this...

insert #orders
select o.CustomerID, sum( o.Amount ) as Amount_sum
from Orders o
join Customers c on c.ID = o.CustomerID
where c.Name like @Name
group by o.CustomerID

select c.Name, o.Amount_sum
from Customers c
join #orders o on o.CustomerID = c.ID
where c.Name like @Name

...but that would be wrong, because now we're duplicating logic, and that could be a source of bugs later. I believe that the most desirable solution in this case is to perform a second Extract Derived Table on the top-level query, and use the results of that query to restrict the other one:

create table #customers
(
  ID int,
  Name varchar(128),
  primary key ( ID )
)

create table #orders
(
  CustomerID int,
  Amount_sum money,
  primary key ( CustomerID )
)

insert #customers
select ID, Name
from Customers
where Name like @Name

insert #orders
select o.CustomerID, sum( o.Amount ) as Amount_sum
from #customers c
join Orders o on o.CustomerID = c.ID
group by o.CustomerID

select c.Name, o.Amount_sum
from #customers c
join #orders o on o.CustomerID = c.ID

The #customers table is populated before the #orders table, and then the contents of the #customers table is used to restrict the contents of the #orders table by an inner join. In this way, no logic is duplicated, and the restriction criteria belonging to each table (or group of tables) is kept together.

In closing, I'd like to emphasize that for the example query we started with, decomposing it like this would be overkill. However, for larger, more complex queries, I have applied this technique with great success, and I recommend applying it any time you need to speed up a query that is already property supported with good indexes, but for some reason just isn't performing as expected or desired.

comments powered by Disqus