I think everyone will agree that code reuse is a good thing. In T-SQL, one way to write reusable code is with user-defined functions (UDFs). UDFs come in three flavors:
- Scalar: Returns a single value
- Table: Returns a table
- Inline: Also returns a table, but may contain only a single statement, like a view, but with parameters.
All three of these have their uses, but today's topic is all about a common mistake people make when they write UDFs, specifically scalar UDFs. Let's look at a simple example. Consider that you have a typical customer orders database, and you've decided to encapsulate the logic of computing a the sum of a customer's orders. Of course, in real life, this is a trivial thing probably not worth abstracting to its own dedicated UDF, but let's pretend that it's something really clever and significant. A scalar UDF would look something like this:
CREATE FUNCTION dbo.Customer_Get_OrderTotal( @CustomerID int )
RETURNS money
AS
BEGIN
declare @result money
select @result = sum( Quantity * Price )
from Orders
where CustomerID = @CustomerID
RETURN @result
END
And it would be used like this:
select Name, dbo.Customer_Get_OrderTotal( CustomerID ) as OrderTotal
from Customers
Can you see the performance problem here? The problem isn't just that a scalar function is being used; it's that the scalar function performs its own table access once for each row in the calling statement. If there are 100 customers in the database, that will be 100 select statements executed on the Orders table; if there are 100,000 customers, that's 100,000 select statements on Orders! This is a very inefficient way to access data, and it will show in the query's performance as the Customer table grows. As a general rule, scalar functions should not access other tables. Now, let's see how to rewrite the above function as a table-valued function that will access the Orders table only once, regardless of the number of rows in Customers.
CREATE FUNCTION dbo.Customer_Get_OrderTotals()
RETURNS @result
(
CustomerID int,
OrderTotal money,
primary key ( CustomerID )
)
AS
BEGIN
insert @result
select CustomerID, sum( Quantity * Price )
from Orders
group by CustomerID
RETURN
END
Here, OrderTotal is computed exactly the same way as before, but now instead of taking CustomerID as a parameter as before, the function computes the order total for every customer and returns the result as a table. This function would get used like this:
select c.Name, ot.OrderTotal
from Customer c
join cbo.Customer_Get_OrderTotals() ot on ot.CustomerID = c.CustomerID
In this scenario, there is only one table access done on the Orders table regardless of how many rows are in the Customer table. But wait, there's a performance problem with this solution, too. What if you only wanted to select a single row with this query, like this:
select c.Name, ot.OrderTotal
from Customer c
join cbo.Customer_Get_OrderTotals() ot on ot.CustomerID = c.CustomerID
where c.CustomerID = @CustomerID
Now we're in a pickle! The table-valued UDF is going to compute the order total for each customer, even though in the end only one of those totals will be used. Thankfully, there is a solution that will not involve trying to predict how many Customer rows will be needed and then deciding on the fly whether to use a scalar or table-valued function. Inline functions to the rescue!
Both table-valued functions and inline functions return tables as results. The difference is that table-valued functions are allowed to contain multiple statements, while inline functions are only allowed to contain a single statement returning a result table. In return, what you get is SQL Server treating the inline function as if it's part of the calling statement and only evaluating the rows needed by the calling statement. You can think of it as if SQL Server replaces the inline function call with the body of the function and evaluates the resulting statement; thus the name "inline" function. Due to the restrictions on inline functions, you won't always be able to rewrite table-valued functions as inline ones, and you'll have to resort to passing parameters that the table-valued function can use to restrict the rows returned. Fortunately for us, this function is a natural fit for inline functions. This is how it would be written:
CREATE FUNCTION dbo.Customer_Get_OrderTotals()
RETURNS TABLE
AS
RETURN
select CustomerID, sum( Quantity * Price )
from Orders
group by CustomerID
It would be called exactly the same way as before. The difference is that now, if only a single customer row is selected, the function will only evaluate for the since CustomerID, but if the entire customer table is selected, the function will evaluate all of the necessary CustomerIDs in a single table access.
In closing:
- Use scalar functions for transforming the input parameters into output, but in general, do not perform data access from withing a scalar function
- Use table-valued functions for encapsulating data access patterns. Prefer inline functions over table-valued functions when possible, and if using an inline function is impossible, make sure to pass parameters that the table-valued function can use to restrict the result set to avoid unnecessary processing.