There once was a time when I would have said that you can't write queries that are both high-performance and easily maintainable. I believed that temp tables were a crutch for the weak, and that a gigantic query with lots of nested subqueries would run more efficiently than the same query broken into multiple parts, using intermediate temp tables to pass information from one query to the next. And now I'm here to tell you that I was wrong.
Experience has taught me that big, monolithic queries are not only much more difficult to maintain, they run slower, too. You see, a given query can be executed in a multitude of different ways, and SQL Server analyzes each query by trying to guess which combination of ways will be the least expensive to execute. The problem with giving SQL Server too big of a query is that the possible combinations to analyze become too many, and the analyzer gets "confused", often choosing a sub-optimal execution plan. The solution is to decompose that big query into a series of smaller ones, which is the main topic of this article.
To illustrate the concept of query decomposition, let's start with a contrived customer-order database query:
select c.Name, o.Amount_sum from Customers c join ( select CustomerID, sum( Amount ) as Amount_sum from Orders group by CustomerID ) o on o.CustomerID = c.ID
Ordinarily, a query this trivial wouldn't need to be decomposed because it's already pretty simple in spite of containing a derived table, so let's pretend that this is just part of a much larger query, and we're only looking at one tiny part, OK?
The first thing I always try when I need to boost the performance of large queries with lots of nested queries like this is perform what amounts to a refactoring that I'll name Extract Derived Table. The steps are:
create table #orders ( CustomerID int, Amount_sum money, primary key ( CustomerID ) ) select c.Name, o.Amount_sum from Customers c join ( select CustomerID, sum( Amount ) as Amount_sum from Orders group by CustomerID ) o on o.CustomerID = c.ID
create table #orders ( CustomerID int, Amount_sum money, primary key ( CustomerID ) ) insert #orders select CustomerID, sum( Amount ) as Amount_sum from Orders group by CustomerID select c.Name, o.Amount_sum from Customers c join ( select CustomerID, sum( Amount ) as Amount_sum from Orders group by CustomerID ) o on o.CustomerID = c.ID
create table #orders ( CustomerID int, Amount_sum money, primary key ( CustomerID ) ) insert #orders select CustomerID, sum( Amount ) as Amount_sum from Orders group by CustomerID select c.Name, o.Amount_sum from Customers c join #orders o on o.CustomerID = c.ID
One objection that an onlooker might raise at this point is that the result of the refactoring is longer than the original implementation. This is true, but in my experience, the tradeoff is worth it, both in performance as well as readability, as this technique eliminates levels of indentation, which become a significant obstacle to readability in queries with many nested subqueries.
Now let's throw ourselves a curveball: what if the final query has a WHERE clause that would have served to reduce the number of rows returned by the inner query as well. SQL Server is smart enough to figure out how to optimize the execution of the inner query when the outer query is restricted, but not after we've decomposed the queries like this. Let's start by just slapping a WHERE clause on the final query:
select c.Name, o.Amount_sum from Customers c join #orders o on o.CustomerID = c.ID where c.Name like @Name
What's going to happen now is that the entire Orders table will be summarized into the #orders temp table, but then only a single row out of it will be used in the final output. Seems like a waste, doesn't it? We could fix it by putting something like this...
insert #orders select o.CustomerID, sum( o.Amount ) as Amount_sum from Orders o join Customers c on c.ID = o.CustomerID where c.Name like @Name group by o.CustomerID select c.Name, o.Amount_sum from Customers c join #orders o on o.CustomerID = c.ID where c.Name like @Name
...but that would be wrong, because now we're duplicating logic, and that could be a source of bugs later. I believe that the most desirable solution in this case is to perform a second Extract Derived Table on the top-level query, and use the results of that query to restrict the other one:
create table #customers ( ID int, Name varchar(128), primary key ( ID ) ) create table #orders ( CustomerID int, Amount_sum money, primary key ( CustomerID ) ) insert #customers select ID, Name from Customers where Name like @Name insert #orders select o.CustomerID, sum( o.Amount ) as Amount_sum from #customers c join Orders o on o.CustomerID = c.ID group by o.CustomerID select c.Name, o.Amount_sum from #customers c join #orders o on o.CustomerID = c.ID
The #customers table is populated before the #orders table, and then the contents of the #customers table is used to restrict the contents of the #orders table by an inner join. In this way, no logic is duplicated, and the restriction criteria belonging to each table (or group of tables) is kept together.
In closing, I'd like to emphasize that for the example query we started with, decomposing it like this would be overkill. However, for larger, more complex queries, I have applied this technique with great success, and I recommend applying it any time you need to speed up a query that is already property supported with good indexes, but for some reason just isn't performing as expected or desired.
Remember Me
a@href@title, i, strike, u
Copyright © 2003-2008 Falafel Software Inc.
Subscribe to Falafel Blogs
The opinions expressed herein are Falafel's employees own personal opinions and do not represent Falafel Software's view in any way in case they go bananas!