navigation
 Monday, April 07, 2008

I think everyone will agree that code reuse is a good thing. In T-SQL, one way to write reusable code is with user-defined functions (UDFs). UDFs come in three flavors:

  1. Scalar: Returns a single value
  2. Table: Returns a table
  3. Inline: Also returns a table, but may contain only a single statement, like a view, but with parameters.

All three of these have their uses, but today's topic is all about a common mistake people make when they write UDFs, specifically scalar UDFs. Let's look at a simple example. Consider that you have a typical customer orders database, and you've decided to encapsulate the logic of computing a the sum of a customer's orders. Of course, in real life, this is a trivial thing probably not worth abstracting to its own dedicated UDF, but let's pretend that it's something really clever and significant. A scalar UDF would look something like this:

CREATE FUNCTION dbo.Customer_Get_OrderTotal( @CustomerID int )
RETURNS money
AS
BEGIN
  declare @result money
  
  select @result = sum( Quantity * Price )
  from Orders
  where CustomerID = @CustomerID
  
  RETURN @result
END

And it would be used like this:

select Name, dbo.Customer_Get_OrderTotal( CustomerID ) as OrderTotal
from Customers

Can you see the performance problem here? The problem isn't just that a scalar function is being used; it's that the scalar function performs its own table access once for each row in the calling statement. If there are 100 customers in the database, that will be 100 select statements executed on the Orders table; if there are 100,000 customers, that's 100,000 select statements on Orders! This is a very inefficient way to access data, and it will show in the query's performance as the Customer table grows. As a general rule, scalar functions should not access other tables. Now, let's see how to rewrite the above function as a table-valued function that will access the Orders table only once, regardless of the number of rows in Customers.

CREATE FUNCTION dbo.Customer_Get_OrderTotals()
RETURNS @result
(
  CustomerID int,
  OrderTotal money,
  primary key ( CustomerID )
)
AS
BEGIN
    insert @result
    select CustomerID, sum( Quantity * Price )
    from Orders
    group by CustomerID
    
  RETURN
END

Here, OrderTotal is computed exactly the same way as before, but now instead of taking CustomerID as a parameter as before, the function computes the order total for every customer and returns the result as a table. This function would get used like this:

select c.Name, ot.OrderTotal
from Customer c
join cbo.Customer_Get_OrderTotals() ot on ot.CustomerID = c.CustomerID

In this scenario, there is only one table access done on the Orders table regardless of how many rows are in the Customer table. But wait, there's a performance problem with this solution, too. What if you only wanted to select a single row with this query, like this:

select c.Name, ot.OrderTotal
from Customer c
join cbo.Customer_Get_OrderTotals() ot on ot.CustomerID = c.CustomerID
where c.CustomerID = @CustomerID

Now we're in a pickle! The table-valued UDF is going to compute the order total for each customer, even though in the end only one of those totals will be used. Thankfully, there is a solution that will not involve trying to predict how many Customer rows will be needed and then deciding on the fly whether to use a scalar or table-valued function. Inline functions to the rescue!

Both table-valued functions and inline functions return tables as results. The difference is that table-valued functions are allowed to contain multiple statements, while inline functions are only allowed to contain a single statement returning a result table. In return, what you get is SQL Server treating the inline function as if it's part of the calling statement and only evaluating the rows needed by the calling statement. You can think of it as if SQL Server replaces the inline function call with the body of the function and evaluates the resulting statement; thus the name "inline" function. Due to the restrictions on inline functions, you won't always be able to rewrite table-valued functions as inline ones, and you'll have to resort to passing parameters that the table-valued function can use to restrict the rows returned. Fortunately for us, this function is a natural fit for inline functions. This is how it would be written:

CREATE FUNCTION dbo.Customer_Get_OrderTotals()
RETURNS TABLE
AS
RETURN
select CustomerID, sum( Quantity * Price )
from Orders
group by CustomerID

It would be called exactly the same way as before. The difference is that now, if only a single customer row is selected, the function will only evaluate for the since CustomerID, but if the entire customer table is selected, the function will evaluate all of the necessary CustomerIDs in a single table access.

In closing:

  • Use scalar functions for transforming the input parameters into output, but in general, do not perform data access from withing a scalar function
  • Use table-valued functions for encapsulating data access patterns. Prefer inline functions over table-valued functions when possible, and if using an inline function is impossible, make sure to pass parameters that the table-valued function can use to restrict the result set to avoid unnecessary processing.
Name
E-mail
Home page

Comment (Some html is allowed: a@href@title, i, strike, u) where the @ means "attribute." For example, you can use <a href="" title=""> or <blockquote cite="Scott">.  

Enter the code shown (prevents robots):

Live Comment Preview