navigation
 Wednesday, April 30, 2008

When writing queries that group and aggregate, it is not uncommon for there to be many columns in the SELECT clause that are neither in the GROUP BY clause nor being aggregated. For example, in a customer orders database, suppose you want a query that returns customer names, telephone numbers, and total order amount. A correct, but somewhat naive solution would go like this:

select c.Name, c.Phone, sum( o.Amount ) as Amount_sum
from Customers c
join Orders o on o.CustomerID = c.CustomerID
group by c.Name, c.Phone

This will return the desired results, but in my experience, will not deliver optimal performance. I can think of two better-performing approaches. One seems more explicit about what the desired output is, but is more verbose:

select c.Name, c.Phone, o.Amount_sum
from Customers c
join (
  select CustomerID, sum( Amount ) as Amount_sum
  from Orders
  group by CustomerID
) o on o.CustomerID = c.CustomerID

This approach pushes all the aggregation into a subquery and then joins to the outer, non-aggregated tables. Despite appearances, this approach does not process the entire Orders table; SQL Server will only evaluate the inner query for values in the outer one, so if there was a WHERE clause restricting the customers appearing in the outer query, only those customers' order totals would be evaluated in the subquery.

The other approach is more of a quick and dirty solution. It executes just as quickly and is less verbose, but the meaning is slightly obscured:

select min( c.Name ) as Name, min( c.Phone ) as Phone, sum( o.Amount ) as Amount_sum
from Customers c
join Orders o on o.CustomerID = c.CustomerID
group by c.CustomerID

In this approach, grouping is performed on a unique index on one table, and any other column in the same table is guaranteed to contain the same value in that group, so an aggregate like MIN() or MAX() will return the only possible value in that row. My experience has been that this approach performs faster than the first approach of grouping by every column that in not aggregated. However, this technique does not communicate programmer intent as well as the second approach, and should either be documented with a comment or with good naming conventions. For example, when the resulting value really is an aggregate above, the column is named <base column>_<aggregate>, and when the resulting value is the only value in that grouping, the column is simply named after the base column.

In summary, both techniques have merit in different situations. As a general guideline, I would recommend using the more explicit technique of pushing aggregations into their own subqueries when writing a query from scratch or modifying a fairly simple query. The second, less verbose approach is a quick and easy way to add columns to an existing, complex grouped query with low risk of altering the query's pre-existing behavior, which is important in maintenance scenarios.