SQL Server 2005 Beta 2 Transact-SQL Enhancements

 

Itzik Ben-Gan
Solid Quality Learning

December, 2004

Applies to:
   Transact-SQL
   Microsoft SQL Server 2005 Beta 2

Summary: This white paper introduces several of the new enhancements to Transact-SQL in Microsoft SQL Server 2005 Beta 2. These new features can increase your expressive power, the performance of your queries, and your error management capabilities. This paper focuses mainly on relational enhancements that are conceptually new and demonstrates these through practical examples. This paper does not cover every new Transact-SQL feature. (91 printed pages)

Contents

Introduction and Scope
Increasing the Expressive Power of Queries and DRI Support
Partitioning
Single-Parent Environment: Employees Organizational Chart
Multiparent Environment: Bill of Materials
Table-Valued Functions in Correlated Subqueries
Enhancements for Performance and Error Handling
Other SQL Server 2005 Beta 2 Capabilities That Affect Transact-SQL
Conclusion

Introduction and Scope

This white paper introduces several of the new enhancements to Transact-SQL in Microsoft SQL Server 2005 Beta 2. These new features can increase your expressive power, the performance of your queries, and your error management capabilities. This paper focuses mainly on relational enhancements that are conceptually new and demonstrates these through practical examples. This paper does not cover every new Transact-SQL feature.

Assumed knowledge: The target audience is expected to be skilled at using Transact-SQL for ad hoc queries and as components of applications in Microsoft SQL Server 2000.

Increasing the Expressive Power of Queries and DRI Support

This section introduces the following new relational features and enhancements:

  • New ranking functions
  • New recursive queries based on common table expressions (CTE)
  • New PIVOT and APPLY relational operators
  • Declarative referential integrity (DRI) enhancements

Ranking Functions

SQL Server 2005 introduces four new ranking functions: ROW_NUMBER, RANK, DENSE_RANK and NTILE. The new functions allow you to efficiently analyze data and provide ranking values to result rows of a query. Typical scenarios where you may find the new functions helpful include: assigning sequential integers to result rows for presentation purposes, paging, scoring, and histograms.

Speaker Statistics Scenario

The following Speaker Statistics scenario will be used to discuss and demonstrate the different functions and their clauses. A large computing conference included three tracks: database, development, and system administration. Eleven speakers spoke in the conference and got scores in the range 1 through 9 for their sessions. The results were summarized and stored in the following SpeakerStats table:

USE tempdb -- or your own test database
CREATE TABLE SpeakerStats
(
  speaker        VARCHAR(10) NOT NULL PRIMARY KEY,
  track          VARCHAR(10) NOT NULL,
  score          INT         NOT NULL,
  pctfilledevals INT         NOT NULL,
  numsessions    INT         NOT NULL
)

SET NOCOUNT ON
INSERT INTO SpeakerStats VALUES('Dan',     'Sys', 3, 22, 4)
INSERT INTO SpeakerStats VALUES('Ron',     'Dev', 9, 30, 3)
INSERT INTO SpeakerStats VALUES('Kathy',   'Sys', 8, 27, 2)
INSERT INTO SpeakerStats VALUES('Suzanne', 'DB',  9, 30, 3)
INSERT INTO SpeakerStats VALUES('Joe',     'Dev', 6, 20, 2)
INSERT INTO SpeakerStats VALUES('Robert',  'Dev', 6, 28, 2)
INSERT INTO SpeakerStats VALUES('Mike',    'DB',  8, 20, 3)
INSERT INTO SpeakerStats VALUES('Michele', 'Sys', 8, 31, 4)
INSERT INTO SpeakerStats VALUES('Jessica', 'Dev', 9, 19, 1)
INSERT INTO SpeakerStats VALUES('Brian',   'Sys', 7, 22, 3)
INSERT INTO SpeakerStats VALUES('Kevin',   'DB',  7, 25, 4)

Each speaker has one row in the table with the speaker's name, track, average score, percent of attendees that filled evaluations in respect to the number of attendees that attended the sessions, and the number of sessions delivered by the speaker. This section demonstrates how to analyze the speaker statistics data to generate useful information using the new ranking functions.

Semantics

All four ranking functions follow a similar syntax pattern:

Ranking Function

<function_name>() OVER(
  [PARTITION BY <partition_by_list>]
  ORDER BY <order_by_list>)

The function can be specified only in two clauses of a query—in the SELECT clause or in the ORDER BY clause. The following sections discuss the different functions in detail.

ROW_NUMBER

The ROW_NUMBER function allows you to provide sequential integer values to result rows of a query. For example, suppose you wanted to return the speaker, track, and score of all speakers, assigning sequential values from 1 and on to the result rows according to descending score order. The following query generates the desired results by using the ROW_NUMBER function, specifying OVER (ORDER BY score DESC):

SELECT ROW_NUMBER() OVER(ORDER BY score DESC) AS rownum, 
  speaker, track, score
FROM SpeakerStats
ORDER BY score DESC

Here is the result set:

rownum speaker    track      score
------ ---------- ---------- -----------
1      Jessica    Dev        9
2      Ron        Dev        9
3      Suzanne    DB         9
4      Kathy      Sys        8
5      Michele    Sys        8
6      Mike       DB         8
7      Kevin      DB         7
8      Brian      Sys        7
9      Joe        Dev        6
10     Robert     Dev        6
11     Dan        Sys        3

The speaker with the highest score got row number 1, and the speaker with the lowest score got row number 11. ROW_NUMBER always generates distinct row numbers to different rows according to the requested sort. Note that if the ORDER BY list specified within the OVER() option is not unique, the result is non-deterministic. This means that there's more than one correct result to the query; in different invocations of the query you might get different results. For example, in our case three different speakers got the same highest score (9): Jessica, Ron, and Suzanne. Since SQL Server has to assign different row numbers to the different speakers, you should assume that the values 1, 2, and 3 assigned Jessica, Ron, and Suzanne respectively were assigned in arbitrary order among those speakers. The result would have been just as correct if the values 1, 2, and 3 were assigned to Ron, Suzanne, and Jessica respectively.

If you specify a unique ORDER BY list, the result is always deterministic. For example, suppose that in the case of a tie between speakers based on score you want to use the highest pctfilledevals value as the tiebreaker. If there's still a tie, use the highest numsessions value as the tiebreaker. Finally, if there's still a tie, use the lowest dictionary-order speaker name as a tiebreaker. Since the ORDER BY list—score, pctfilledevals, numsessions, and speaker—is unique, the result is deterministic:

SELECT ROW_NUMBER() OVER(ORDER BY score DESC, pctfilledevals DESC,
                           numsessions DESC, speaker) AS rownum, 
  speaker, track, score, pctfilledevals, numsessions
FROM SpeakerStats
ORDER BY score DESC, pctfilledevals DESC, numsessions DESC, speaker

Here is the result set:

rownum speaker    track      score       pctfilledevals numsessions
------ ---------- ---------- ----------- -------------- -----------
1      Ron        Dev        9           30             3
2      Suzanne    DB         9           30             3
3      Jessica    Dev        9           19             1
4      Michele    Sys        8           31             4
5      Kathy      Sys        8           27             2
6      Mike       DB         8           20             3
7      Kevin      DB         7           25             4
8      Brian      Sys        7           22             3
9      Robert     Dev        6           28             2
10     Joe        Dev        6           20             2
11     Dan        Sys        3           22             4

One of the important benefits of the new ranking functions is their efficiency. SQL Server's optimizer needs to scan the data only once in order to calculate the values. It does this either by using an ordered scan of an index placed on the sort columns or by scanning the data once and sorting it if an appropriate index was not created.

Another benefit is the simplicity of the syntax. To give you a sense of how difficult and inefficient it is to calculate ranking values by using the set-based approach used in earlier releases of SQL Server, consider the following SQL Server 2000 query, which returns the same results as the previous query:

SELECT
  (SELECT COUNT(*)
   FROM SpeakerStats AS S2
   WHERE S2.score > S1.score
     OR (S2.score = S1.score
         AND S2.pctfilledevals > S1.pctfilledevals)
     OR (S2.score = S1.score
         AND S2.pctfilledevals = S1.pctfilledevals
         AND S2.numsessions > S1.numsessions)
     OR (S2.score = S1.score
         AND S2.pctfilledevals = S1.pctfilledevals
         AND S2.numsessions = S1.numsessions
         AND S2.speaker < S1.speaker)) + 1 AS rownum,
  speaker, track, score, pctfilledevals, numsessions
FROM SpeakerStats AS S1
ORDER BY score DESC, pctfilledevals DESC, numsessions DESC, speaker

This query is obviously much more complex than the SQL Server 2005 query. Furthermore, for each base row in the SpeakerStats table, SQL Server has to scan all matching rows in another instance of the table. On average, about half (at minimum) of the table's rows need to be scanned per each row in the base table. Performance degradation of the SQL Server 2005 query is linear, while performance degradation of the SQL Server 2000 query is exponential. Even in fairly small tables the performance difference is significant. For example, test the performance of the following queries which query the SalesOrderHeader table in the AdventureWorks database to calculate row numbers for sales orders according to SalesOrderID order. The SalesOrderHeader table has 31,465 rows. The first query uses the SQL Server 2005 ROW_NUMBER function, while the second query uses the SQL Server 2000 subquery technique:

-- SQL Server 2005 query
SELECT SalesOrderID,
  ROW_NUMBER() OVER(ORDER BY SalesOrderID) AS rownum
FROM Sales.SalesOrderHeader
-- SQL Server 2000 query
SELECT SalesOrderID,
  (SELECT COUNT(*)
   FROM Sales.SalesOrderHeader AS S2
   WHERE S2.SalesOrderID <= S1.SalesOrderID) AS rownum
FROM Sales.SalesOrderHeader AS S1

I ran this test on my laptop (Compaq Presario X1020U, CPU: Centrino 1.4 GH, RAM: 1GB, local HD). The SQL Server 2005 query finished in only 1 second while the SQL Server 2000 query finished in about 12 minutes.

A typical application for row numbers is to page through the results of a query. Given a page size in terms of number of rows, and a page number, you need to return the rows that belong to the given page. For example, suppose you want to return the second page of rows from the SpeakerStats table, assuming a page size of three rows, according to score DESC, speaker order. The following query first calculates row numbers according to the specified sort in the derived table D, then it filters only rows with the row numbers 4 through 6, which belong to the second page:

SELECT *
FROM (SELECT ROW_NUMBER() OVER(ORDER BY score DESC, speaker) AS rownum, 
        speaker, track, score
      FROM SpeakerStats) AS D
WHERE rownum BETWEEN 4 AND 6
ORDER BY score DESC, speaker

Here is the result set:

rownum speaker    track      score
------ ---------- ---------- -----------
4      Kathy      Sys        8
5      Michele    Sys        8
6      Mike       DB         8

In more generic terms, given a page number in the @pagenum variable, and a page size in the @pagesize variable, the following query returns the rows that belong to the desired page:

DECLARE @pagenum AS INT, @pagesize AS INT
SET @pagenum = 2
SET @pagesize = 3
SELECT *
FROM (SELECT ROW_NUMBER() OVER(ORDER BY score DESC, speaker) AS rownum, 
        speaker, track, score
      FROM SpeakerStats) AS D
WHERE rownum BETWEEN (@pagenum-1)*@pagesize+1 AND @pagenum*@pagesize
ORDER BY score DESC, speaker

The above approach is adequate for ad hoc requests when you're only interested in one specific page of the rows. However, this approach is not adequate when the user issues multiple requests because each invocation of the query costs you a complete scan of the table in order to calculate the row numbers. For more efficient paging when the user might repeatedly request different pages, first populate a temporary table with all of the base table rows including calculated row numbers and index the column containing the row numbers:

SELECT ROW_NUMBER() OVER(ORDER BY score DESC, speaker) AS rownum, *
INTO #SpeakerStatsRN
FROM SpeakerStats
CREATE UNIQUE CLUSTERED INDEX idx_uc_rownum ON #SpeakerStatsRN(rownum)

And then for each requested page issue the following query:

SELECT rownum, speaker, track, score
FROM #SpeakerStatsRN
WHERE rownum BETWEEN (@pagenum-1)*@pagesize+1 AND @pagenum*@pagesize
ORDER BY score DESC, speaker

Only the rows that belong to the desired page will be scanned.

Partitioning

Ranking values can be calculated within groups of rows independently as opposed to being calculated for all table rows as one group. To do this, use the PARTITION BY clause and specify a list of expressions that identify the groups of rows for which ranking values should be calculated independently. For example, the following query assigns row numbers within each track separately according to score DESC, speaker order:

SELECT track, 
  ROW_NUMBER() OVER(
    PARTITION BY track 
    ORDER BY score DESC, speaker) AS pos, 
  speaker, score
FROM SpeakerStats
ORDER BY track, score DESC, speaker

Here is the result set:

track      pos speaker    score
---------- --- ---------- -----------
DB         1   Suzanne    9
DB         2   Mike       8
DB         3   Kevin      7
Dev        1   Jessica    9
Dev        2   Ron        9
Dev        3   Joe        6
Dev        4   Robert     6
Sys        1   Kathy      8
Sys        2   Michele    8
Sys        3   Brian      7
Sys        4   Dan        3

Specifying the track column in the PARTITION BY clause causes row numbers to be calculated independently for each group of rows with the same track.

RANK, DENSE_RANK

The RANK and DENSE_RANK functions are very similar to the ROW_NUMBER function in the sense that they also provide ranking values according to a specified sort, optionally within groups (partitions) of rows. However, unlike ROW_NUMBER, RANK and DENSE_RANK assign the same ranks to rows with the same values in the sort columns. RANK and DENSE_RANK are useful when you don't want to assign different ranks for rows with the same values in the ORDER BY list when the ORDER BY list is not unique. The purpose of RANK and DENSE_RANK and the difference between the two is best explained with an example. The following query calculates row number, rank, and dense rank values to the different speakers according to score DESC order:

SELECT speaker, track, score,
  ROW_NUMBER() OVER(ORDER BY score DESC) AS rownum,
  RANK() OVER(ORDER BY score DESC) AS rnk,
  DENSE_RANK() OVER(ORDER BY score DESC) AS drnk
FROM SpeakerStats
ORDER BY score DESC

Here is the result set:

speaker    track      score       rownum rnk drnk
---------- ---------- ----------- ------ --- ----
Jessica    Dev        9           1      1   1
Ron        Dev        9           2      1   1
Suzanne    DB         9           3      1   1
Kathy      Sys        8           4      4   2
Michele    Sys        8           5      4   2
Mike       DB         8           6      4   2
Kevin      DB         7           7      7   3
Brian      Sys        7           8      7   3
Joe        Dev        6           9      9   4
Robert     Dev        6           10     9   4
Dan        Sys        3           11     11  5

As discussed earlier, the score column is not unique, so different speakers might have the same score. Row numbers do represent a descending score order, but speakers with the same score still get different row numbers. However, notice in the result that all speakers with the same scores get the same rank and dense rank values. In other words, ROW_NUMBER is not deterministic when the ORDER BY list is not unique, while RANK and DENSE_RANK are always deterministic. The difference between the rank and dense rank values is that rank stands for: the number of rows with a higher score plus one, while dense rank stands for: the number of distinct higher scores plus one. From what you've learned so far you can deduce that ROW_NUMBER, RANK, and DENSE_RANK produce the exact same values when the ORDER BY list is unique.

NTILE

NTILE allows you to separate the result rows of a query into a specified number of groups (tiles) according to a specified order. Each group of rows gets a different number starting with 1 for the first group, 2 for the second, and so on. You specify the requested number of groups in the parentheses following the function's name, and the requested sort in the ORDER BY clause of the OVER option. The number of rows in a group is calculated as total_num_rows / num_groups. If there's a remainder n, the first n groups get an additional row. So all groups might not get an equal number of rows, but the group sizes might differ at most by one row. For example, the following query assigns three group numbers to the different speaker rows according to descending score order:

SELECT speaker, track, score,
  ROW_NUMBER() OVER(ORDER BY score DESC) AS rownum,
  NTILE(3) OVER(ORDER BY score DESC) AS tile
FROM SpeakerStats
ORDER BY score DESC

Here is the result set:

speaker    track      score       rownum tile
---------- ---------- ----------- ------ ----
Jessica    Dev        9           1      1
Ron        Dev        9           2      1
Suzanne    DB         9           3      1
Kathy      Sys        8           4      1
Michele    Sys        8           5      2
Mike       DB         8           6      2
Kevin      DB         7           7      2
Brian      Sys        7           8      2
Joe        Dev        6           9      3
Robert     Dev        6           10     3
Dan        Sys        3           11     3

There are 11 speakers in the SpeakerStats table. Dividing 11 by 3 gives you a group size of 3 with a remainder of 2, meaning that the first 2 groups will get an additional row (4 rows in each group), and the third group won't (3 rows in the group). The group number (tile number) 1 is assigned to rows 1 through 4, group number 2 is assigned to rows 5 through 8, and group number 3 is assigned to rows 9 through 11. This information allows you to generate a histogram with an even distribution of items for each step. In our case, the first step represents the speakers with the highest scores, the second represents the speakers with the medium scores, and the third represents the speakers with the lowest scores. You can use a CASE expression to provide descriptive meaningful alternatives to the group numbers:

SELECT speaker, track, score,
  CASE NTILE(3) OVER(ORDER BY score DESC)
    WHEN 1 THEN 'High'
    WHEN 2 THEN 'Medium'
    WHEN 3 THEN 'Low'
  END AS scorecategory
FROM SpeakerStats
ORDER BY track, speaker

Here is the result set:

speaker    track      score       scorecategory
---------- ---------- ----------- -------------
Kevin      DB         7           Medium
Mike       DB         8           Medium
Suzanne    DB         9           High
Jessica    Dev        9           High
Joe        Dev        6           Low
Robert     Dev        6           Low
Ron        Dev        9           High
Brian      Sys        7           Medium
Dan        Sys        3           Low
Kathy      Sys        8           High
Michele    Sys        8           Medium

Recursive Queries and Common Table Expressions

This section explores the subtleties of recursive CTE expressions and applies them as solutions to common problems in a way that greatly simplifies traditional approaches.

Common Table Expressions

A common table expression (CTE) is a temporary named result set that can be referred to by a defining statement. In their simple form, you can think of CTEs as an improved version of derived tables that more closely resemble a nonpersistent type of view. You refer to a CTE in the FROM clause of a query similar to the way you refer to derived tables and views. You define the CTE only once and can refer to it several times in your query. In the definition of a CTE, you can refer to variables that are defined in the same batch. You can even use CTEs in INSERT, UPDATE, DELETE, and CREATE VIEW statements, similar to the way you use views. The real power of CTEs, however, is in their recursive capabilities, in which CTEs contain references to themselves. In this paper, CTEs are described first in their simple form and later in their recursive form. This paper covers SELECT queries with CTEs.

You use derived tables when you want to refer to a query result as if it were a table, but you do not want to create a persistent view in the database. Derived tables, however, have a limitation not found in CTEs: You cannot define a derived table once in your query and use it several times. Instead, you must define several derived tables with the same query. You can, however, define a CTE once and use it several times in a query without persisting it in the database.

Before providing a practical example of CTEs, the basic syntax of CTEs is compared to derived tables and views. The following is a general form of a query within a view, derived table, and CTE:

View

CREATE VIEW <view_name>(<column_aliases>)
AS
<view_query>
GO
SELECT *
FROM <view_name>

Derived Table

SELECT *
FROM (<derived_table_query>) AS <derived_table_alias>(<column_aliases>)

CTE

WITH <cte_alias>(<column_aliases>)
AS
(

  <cte_query>
)
SELECT *
FROM <cte_alias>

You provide the CTE with an alias and an optional list of aliases for its result columns following the keyword WITH; write the body of the CTE; and refer to it from the outer query.

Note that if the WITH clause for a CTE is not the first statement in the batch, you should delimit it from the preceding statement by placing a semicolon (;) in front of it. The semicolon is used to avoid ambiguity with other uses of the WITH clause (for example, for table hints). Although you may find that including a semicolon is not necessary in all cases, it is recommended that you use it consistently.

As a practical example, consider the HumanResources.Employee and Purchasing.PurchaseOrderHeader tables in the AdventureWorks database. Each employee reports to a manager specified in the ManagerID column. Each employee in the Employee table might have related orders in the PurchaseOrderHeader table. Suppose you want to return, for each employee, their count of orders and last order date and, in the same row, similar details for the manager. The following example shows how you can implement a solution using views, derived tables, and CTEs:

View

CREATE VIEW VEmpOrders(EmployeeID, NumOrders, MaxDate)
AS
SELECT EmployeeID, COUNT(*), MAX(OrderDate)
FROM Purchasing.PurchaseOrderHeader
GROUP BY EmployeeID
GO
SELECT E.EmployeeID, OE.NumOrders, OE.MaxDate,
  E.ManagerID, OM.NumOrders, OM.MaxDate
FROM HumanResources.Employee AS E
  JOIN VEmpOrders AS OE
    ON E.EmployeeID = OE.EmployeeID
  LEFT OUTER JOIN VEmpOrders AS OM
    ON E.ManagerID = OM.EmployeeID

Derived Tables

SELECT E.EmployeeID, OE.NumOrders, OE.MaxDate,
  E.ManagerID, OM.NumOrders, OM.MaxDate
FROM HumanResources.Employee AS E

  JOIN (SELECT EmployeeID, COUNT(*), MAX(OrderDate)
        FROM Purchasing.PurchaseOrderHeader
        GROUP BY EmployeeID) AS OE(EmployeeID, NumOrders, MaxDate)
    ON E.EmployeeID = OE.EmployeeID
  LEFT OUTER JOIN
       (SELECT EmployeeID, COUNT(*), MAX(OrderDate)
        FROM Purchasing.PurchaseOrderHeader
        GROUP BY EmployeeID) AS OM(EmployeeID, NumOrders, MaxDate)
    ON E.ManagerID = OM.EmployeeID

CTE

WITH EmpOrdersCTE(EmployeeID, NumOrders, MaxDate)
AS
(
  SELECT EmployeeID, COUNT(*), MAX(OrderDate)
  FROM Purchasing.PurchaseOrderHeader
  GROUP BY EmployeeID
)
SELECT E.EmployeeID, OE.NumOrders, OE.MaxDate,
  E.ManagerID, OM.NumOrders, OM.MaxDate
FROM HumanResources.Employee AS E
  JOIN EmpOrdersCTE AS OE
    ON E.EmployeeID = OE.EmployeeID
  LEFT OUTER JOIN EmpOrdersCTE AS OM
    ON E.ManagerID = OM.EmployeeID
The CTE's definition must be followed by an outer query, which may or may not refer to it. You cannot refer to the CTE later in the batch after other intervening statements. 

You can define several CTEs in the same WITH clause, each referring to previously defined CTEs. Commas are used to delimit the CTEs. For example, suppose you wanted to calculate the minimum, maximum, and difference of counts of employee orders:

WITH
EmpOrdersCTE(EmployeeID, Cnt)
AS
(
  SELECT EmployeeID, COUNT(*)
  FROM Purchasing.PurchaseOrderHeader

  GROUP BY EmployeeID
),
MinMaxCTE(MN, MX, Diff)
AS
(
  SELECT MIN(Cnt), MAX(Cnt), MAX(Cnt)-MIN(Cnt)
  FROM EmpOrdersCTE
)
SELECT * FROM MinMaxCTE

Here is the result set:

MN          MX          Diff       
----------- ----------- -----------
160         400         240        

In EmpOrdersCTE, you calculate the number of orders from each employee. In MinMaxCTE, you refer to EmpOrdersCTE to calculate the minimum, maximum and difference of counts.

Note Within a CTE, you are not limited to referring only to the CTE defined directly before it; rather you can refer to all previously defined CTEs. Note that forward references are not allowed: A CTE can refer to CTEs defined before it and to itself (see Recursive Queries later in this paper) but not to CTEs defined after it. For example, if you define CTEs C1, C2, C3 in the same WITH statement, C2 can refer to C1 and C2 but not to C3.

In another example, the following code generates a histogram that calculates the number of employees that fall within each of four ranges of order counts between the minimum and maximum. If the calculations appear complicated to you, do not spend time trying to figure them out. The purpose of this example is to use a practical scenario to demonstrate the declaration of multiple CTEs in the same WITH statement, where each might refer to previous CTEs.

WITH
EmpOrdersCTE(EmployeeID, Cnt)
AS
(
  SELECT EmployeeID, COUNT(*)
  FROM Purchasing.PurchaseOrderHeader
  GROUP BY EmployeeID
),

MinMaxCTE(MN, MX, Diff)
AS
(
  SELECT MIN(Cnt), MAX(Cnt), MAX(Cnt)-MIN(Cnt)
  FROM EmpOrdersCTE
),
NumsCTE(Num)
AS
(
  SELECT 1 AS Num
  UNION ALL SELECT 2
  UNION ALL SELECT 3
  UNION ALL SELECT 4
),
StepsCTE(Step, Fromval, Toval)
AS
(
  SELECT
    Num,
    CAST(MN + ROUND((Num-1)*((Diff+1)/4.), 0) AS INT),
    CAST(MN + ROUND((Num)*((Diff+1)/4.), 0) - 1 AS INT)
  FROM MinMaxCTE CROSS JOIN NumsCTE
),
HistogramCTE(Step, Fromval, Toval, Samples)
AS
(
  SELECT S.Step, S.Fromval, S.Toval, COUNT(EmployeeID)
  FROM StepsCTE AS S
    LEFT OUTER JOIN EmpOrdersCTE AS OE
      ON OE.Cnt BETWEEN S.Fromval AND S.Toval
  GROUP BY S.Step, S.Fromval, S.Toval
)
SELECT * FROM HistogramCTE

Here is the result set:

Step        Fromval     Toval       Samples
----------- ----------- ----------- -----------
1           160         219         2
2           220         280         0
3           281         340         0
4           341         400         10

Notice that the second CTE (MinMaxCTE) refers to the first (EmpOrdersCTE); the third (NumsCTE) does not refer to any CTE. The fourth (StepsCTE) refers to the second and third CTEs, and the fifth (HistogramCTE) refers to the first and fourth CTEs.

Recursive Queries

Nonrecursive CTEs increase your expressive power. But for each piece of code that uses nonrecursive CTEs, you can usually write shorter code that achieves the same results by using other Transact-SQL constructs, such as derived tables. The case is different with recursive CTEs. This section describes the semantics of recursive queries and provides practical implementations for a hierarchy of employees in an organizational chart, and for a bill of materials (BOM) scenario.

Semantics

When a CTE refers to itself, it is considered to be recursive. Recursive CTEs are constructed from at least two queries (or members in recursive query parlance). One is a nonrecursive query, also referred to as the anchor member (AM). The other is the recursive query, also referred to as the recursive member (RM). The queries are separated by a UNION ALL operator. The following example shows a simplified generic form of a recursive CTE:

WITH RecursiveCTE(<column_list>)
AS
(
  -- Anchor Member:
  -- SELECT query that does not refer to RecursiveCTE
  SELECT ... 
  FROM <some_table(s)>
  ...
  UNION ALL
  -- Recursive Member
  -- SELECT query that refers to RecursiveCTE
  SELECT ...
  FROM <some_table(s)>
    JOIN RecursiveCTE

  ...
)
-- Outer Query
SELECT ...
FROM RecursiveCTE
...

Logically you can think of the algorithm implementing the recursive CTE as:

  1. The anchor member is activated. Set R0 (R for Results) is generated.
  2. The recursive member is activated, getting set Ri (i = step number) as input when referring to RecursiveCTE. Set Ri + 1 is generated.
  3. The logic of Step 2 is run repeatedly (incrementing the step number in each iteration) until an empty set is returned.
  4. The outer query is executed, getting the cumulative (UNION ALL) result of all of the previous steps when referring to RecursiveCTE.

You can have more than two members in the CTE, but only a UNION ALL operator is allowed between a recursive member and another member (recursive or nonrecursive). Other operators, such as UNION, are allowed only between nonrecursive members. Unlike regular UNION and UNION ALL operators that support implicit conversion, recursive CTEs require an exact match of the columns in all members, including the same data type, length, and precision.

There are similarities between recursive CTEs and classic recursive routines (not necessarily specific to SQL Server). Recursive routines usually consist of three important elements—the first invocation of the routine, a recursive termination check, and a recursive call to the same routine. The anchor member in a recursive CTE corresponds to the first invocation of the routine in a classic recursive routine. The recursive member corresponds to the recursive invocation of the routine. The termination check, which is usually explicit in recursive routines (for example, by means of an IF statement), is implicit in a recursive CTE—recursion stops when no rows are returned from the previous invocation.

The following sections present practical examples and uses of recursive CTEs in single-parent and multiparent environments.

Single-Parent Environment: Employees Organizational Chart

For a single-parent hierarchy scenario, an employees organizational chart is used.

Note The examples in this section use a table called Employees that has a structure that is different from the HumanResources.Employee table in AdventureWorks. You should run the code in your own test database or in tempdb, not in AdventureWorks.

The following code generates the Employees table and populates it with sample data:

USE tempdb -- or your own test database
CREATE TABLE Employees
(
  empid   int         NOT NULL,
  mgrid   int         NULL,
  empname varchar(25) NOT NULL,
  salary  money       NOT NULL,
  CONSTRAINT PK_Employees PRIMARY KEY(empid),
  CONSTRAINT FK_Employees_mgrid_empid
    FOREIGN KEY(mgrid)
    REFERENCES Employees(empid)
)
CREATE INDEX idx_nci_mgrid ON Employees(mgrid)
SET NOCOUNT ON
INSERT INTO Employees VALUES(1 , NULL, 'Nancy'   , $10000.00)
INSERT INTO Employees VALUES(2 , 1   , 'Andrew'  , $5000.00)
INSERT INTO Employees VALUES(3 , 1   , 'Janet'   , $5000.00)
INSERT INTO Employees VALUES(4 , 1   , 'Margaret', $5000.00) 
INSERT INTO Employees VALUES(5 , 2   , 'Steven'  , $2500.00)
INSERT INTO Employees VALUES(6 , 2   , 'Michael' , $2500.00)
INSERT INTO Employees VALUES(7 , 3   , 'Robert'  , $2500.00)
INSERT INTO Employees VALUES(8 , 3   , 'Laura'   , $2500.00)
INSERT INTO Employees VALUES(9 , 3   , 'Ann'     , $2500.00)
INSERT INTO Employees VALUES(10, 4   , 'Ina'     , $2500.00)
INSERT INTO Employees VALUES(11, 7   , 'David'   , $2000.00)
INSERT INTO Employees VALUES(12, 7   , 'Ron'     , $2000.00)
INSERT INTO Employees VALUES(13, 7   , 'Dan'     , $2000.00)
INSERT INTO Employees VALUES(14, 11  , 'James'   , $1500.00)

Each employee reports to a manager whose ID is stored in the mgrid column. A foreign key is defined on the mgrid column referencing the empid column, meaning that a manager ID must either correspond to a valid employee ID within the table or be NULL. Nancy, the boss, has NULL in the mgrid column. Manager-employee relationships are shown in Figure 1.

ms345144.sql2k5b2_tsqlenhance_01(en-US,SQL.90).gif

Figure 1. Employees organizational chart

The following are some common requests that might be run on the Employees table:

  • Show me details about Robert (empid=7) and all of his subordinates in all levels.
  • Show me details about all employees that are two levels under Janet (empid=3).
  • Show me the chain of management leading to James (empid=14).
  • Show me how many employees report to each manager directly or indirectly.
  • Show me all of the employees in such a way that it will be easy to see their hierarchical dependencies.

Recursive CTEs provide the means to deal with these requests, which are recursive in nature, without the need to maintain additional information in the database about the hierarchy.

The first request is probably the most common one: returning an employee (for example, Robert whose empid=7) and his/her subordinates in all levels. The following CTE provides a solution to this request:

WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 

  -- Anchor Member (AM)
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 7
  UNION ALL
  
  -- Recursive Member (RM)
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees AS E
    JOIN EmpCTE AS M
      ON E.mgrid = M.empid
)
SELECT * FROM EmpCTE

Here is the result set:

empid       empname                   mgrid       lvl        
----------- ------------------------- ----------- -----------
7           Robert                    3           0          
11          David                     7           1          
12          Ron                       7           1          
13          Dan                       7           1          
14          James                     11          2          

Following the recursive CTE logic described previously, this CTE is processed as follows:

The anchor member is activated, returning Robert's row from the Employees table. Notice the constant 0 that is returned in the lvl result column.

  1. The recursive member is activated repeatedly, returning direct subordinates of the previous result by means of a join operation between Employees and EmpCTE. Employees represents the subordinates, and EmpCTE (which contains the result from the previous invocation) represents managers:
    • First, Robert's subordinates are returned: David, Ron, and James.
    • And then David's, Ron's, and Dan's subordinates are returned: only James.
    • Finally, James' subordinates are returned: none, in which case, recursion is terminated.
  2. The outer query returns all rows from EmpCTE.

Notice that the lvl value is repeatedly incremented with each recursive invocation.

Using this level counter you can limit the number of iterations in the recursion. For example, the following CTE is used to return all employees who are two levels below Janet:

WITH EmpCTEJanet(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 3
  UNION ALL
  
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees as E
    JOIN EmpCTEJanet as M
      ON E.mgrid = M.empid
  WHERE lvl < 2
)
SELECT empid, empname
FROM EmpCTEJanet
WHERE lvl = 2

Here is the result set:

empid       empname                  
----------- -------------------------
11          David                    
12          Ron                      
13          Dan                      

The additions in this code example compared to the previous one are shown in bold. The filter WHERE lvl < 2 in the recursive member is used as a recursion termination check—no rows are returned when lvl = 2, thus recursion stops. The filter WHERE lvl = 2 in the outer query is used to remove all levels up to level 2. Note that logically the filter in the outer query (lvl = 2) is sufficient by itself to return only the desired rows. The filter in the recursive member (lvl < 2) is added for performance reasons—to stop the recursion early, as soon as two levels below Janet are returned.

As mentioned earlier, CTEs can refer to local variables that are defined within the same batch. For example, to make the query more generic, you can use variables instead of constants for employee ID and level:

DECLARE @empid AS INT, @lvl AS INT
SET @empid = 3 -- Janet
SET @lvl   = 2 -- two levels
WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = @empid
  UNION ALL
  
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees as E
    JOIN EmpCTE as M
      ON E.mgrid = M.empid
  WHERE lvl < @lvl
)
SELECT empid, empname
FROM EmpCTE
WHERE lvl = @lvl

You can use a hint to force termination of the query after a certain number of recursive iterations have been invoked. You do that by adding OPTION(MAXRECURSION value) at the end of the outer query, as shown in the following example:

WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 1
  UNION ALL

  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees as E
    JOIN EmpCTE as M
      ON E.mgrid = M.empid
)
SELECT * FROM EmpCTE
OPTION (MAXRECURSION 2)

Here is the result set:

empid       empname                   mgrid       lvl        
----------- ------------------------- ----------- -----------
1           Nancy                     NULL        0          
2           Andrew                    1           1          
3           Janet                     1           1          
4           Margaret                  1           1          
10          Ina                       4           2          
7           Robert                    3           2          
8           Laura                     3           2          
9           Ann                       3           2          
.Net SqlClient Data Provider: Msg 530, Level 16, State 1, Line 1
Statement terminated. Maximum recursion 2 has been exhausted before statement completion

Results generated thus far might be returned (but are not guaranteed to be), and error 530 is generated. You might think of using the MAXRECURSION option to implement the request to return employees who are two levels below Janet using the MAXRECURSION hint instead of the filter in the recursive member:

WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 3
  UNION ALL
  
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees as E
    JOIN EmpCTE as M
      ON E.mgrid = M.empid

)
SELECT empid, empname
FROM EmpCTE
WHERE lvl = 2
OPTION (MAXRECURSION 2)

Here is the result set:

empid       empname                  
----------- -------------------------
11          David                    
12          Ron                      
13          Dan                      
.Net SqlClient Data Provider: Msg 530, Level 16, State 1, Line 1
Statement terminated. Maximum recursion 2 has been exhausted before statement completion

But keep in mind that, besides the fact that there is no guarantee that results will be returned, your client will get an error. It is not good programming practice to use code that returns errors in valid situations. It is recommended that you use the filter presented earlier and, if you want, the MAXRECURSION hint as a safeguard against infinite loops.

When this hint is not specified, SQL Server defaults to a value of 100. This value can be used as a safeguard when you suspect cyclic recursive calls. If you do not want to limit the number of recursive calls, set MAXRECURSION to 0 in the hint.

As an example of a cyclic relationship, suppose you had a bug in your data and Nancy's manager was accidentally changed to James (instead of no manager):

UPDATE Employees SET mgrid = 14 WHERE empid = 1

The following cycle is introduced: 1->3->7->11->14->1. If you try running code that returns Nancy and her direct and indirect subordinates at all levels, you get an error indicating that the default maximum recursion of 100 was exhausted before the statement completed:

WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 1
  UNION ALL
  
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1
  FROM Employees AS E

    JOIN EmpCTE AS M
      ON E.mgrid = M.empid
)
SELECT * FROM EmpCTE
Msg 530, Level 16, State 1, Line 1
Statement terminated. Maximum recursion 100 has been exhausted before statement completion

Of course it is good to have a safety measure that prevents infinite recursive calls, but MAXRECURSION doesn't help you much in isolating the cycle and resolving the bug in the data. In order to isolate the cycle, you can use a CTE that will construct, for each employee, an enumerated path of all of the employee IDs leading to the employee. Call this result column path. In the recursive member, use a CASE expression to check whether the current employee ID already appears in the manager's path using a LIKE predicate. If it does, this means you found a cycle. If a cycle is found, return 1 in a result column called cycle, otherwise return 0. Also, add a filter to the recursive member that ensures that only subordinates of managers for which a cycle was not detected will be returned. Finally, add a filter to the outer query that returns only employees for which a cycle was found (cycle = 1):

WITH EmpCTE(empid, path, cycle)
AS
( 
  SELECT empid,
    CAST('.' + CAST(empid AS VARCHAR(10)) + '.' AS VARCHAR(900)),
    0
  FROM Employees
  WHERE empid = 1
  UNION ALL
  
  SELECT E.empid,
    CAST(M.path + CAST(E.empid AS VARCHAR(10)) + '.' AS VARCHAR(900)),
    CASE
      WHEN M.path LIKE '%.' + CAST(E.empid AS VARCHAR(10)) + '.%' THEN 1
      ELSE 0
    END
  FROM Employees AS E
    JOIN EmpCTE AS M
      ON E.mgrid = M.empid
  WHERE M.cycle = 0

)
SELECT path FROM EmpCTE
WHERE cycle = 1
path
---------------
.1.3.7.11.14.1.

Note that corresponding columns in both the anchor member and the recursive member must have the same data type, length, and precision. That's why the expression generating the path value is converted to varbinary(900) in both members. Once the cycle is detected, you can fix the bug in your data by changing Nancy's manager back to no manager:

UPDATE Employees SET mgrid = NULL WHERE empid = 1

The recursive examples provided up to this point have an anchor member that is a manager and a recursive member that retrieves subordinates. Some requests require the opposite; for example, a request to return James' management path (James and all of his managers at all levels). The following code provides an answer for this request:

WITH EmpCTE(empid, empname, mgrid, lvl)
AS
( 
  SELECT empid, empname, mgrid, 0
  FROM Employees
  WHERE empid = 14
  UNION ALL
  
  SELECT M.empid, M.empname, M.mgrid, E.lvl+1
  FROM Employees as M
    JOIN EmpCTE as E
      ON M.empid = E.mgrid
)
SELECT * FROM EmpCTE

Here is the result set:

empid       empname                   mgrid       lvl        
----------- ------------------------- ----------- -----------
14          James                     11          0          
11          David                     7           1          
7           Robert                    3           2          
3           Janet                     1           3          
1           Nancy                     NULL        4          

The anchor member returns James' row. The recursive member returns the managers of the previously returned employees or manager in singular, because a single-parent hierarchy is used here and the request starts with a single employee.

You can also use recursive queries to calculate aggregations, such as the number of subordinates that report to each manager directly or indirectly:

WITH MgrCTE(mgrid, lvl)
AS
(
  SELECT mgrid, 0
  FROM Employees
  WHERE mgrid IS NOT NULL
  UNION ALL
  SELECT M.mgrid, lvl + 1
  FROM Employees AS M
    JOIN MgrCTE AS E
      ON E.mgrid = M.empid
  WHERE M.mgrid IS NOT NULL
)
SELECT mgrid, COUNT(*) AS cnt
FROM MgrCTE
GROUP BY mgrid

Here is the result set:

mgrid       cnt        
----------- -----------
1           13         
2           2          
3           7          
4           1          
7           4          
11          1          

The anchor member returns a row with the manager ID for each employee. NULL in the manager ID column is excluded because it represents no specific manager. The recursive member returns the manager IDs of the managers of the previously returned managers, again NULLs are excluded. Eventually, the CTE contains, for each manager, as many occurrences as their direct or indirect number of subordinates. The outer query is left with the tasks of grouping the result by manager ID and returning the count of occurrences.

As another example of a request against a single-parent hierarchy, suppose you want to return Nancy's subordinates sorted and indented according to hierarchical dependencies. The following code does just that, sorting siblings according to their employee IDs:

WITH EmpCTE(empid, empname, mgrid, lvl, sortcol)
AS
( 
  SELECT empid, empname, mgrid, 0,
    CAST(empid AS VARBINARY(900))
  FROM Employees
  WHERE empid = 1
  UNION ALL
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1,
    CAST(sortcol + CAST(E.empid AS BINARY(4)) AS VARBINARY(900))
  FROM Employees AS E
    JOIN EmpCTE AS M
      ON E.mgrid = M.empid
)
SELECT
  REPLICATE(' | ', lvl)
    + '(' + (CAST(empid AS VARCHAR(10))) + ') '
    + empname AS empname
FROM EmpCTE
ORDER BY sortcol
(1) Nancy
 | (2) Andrew
 |  | (5) Steven
 |  | (6) Michael
 | (3) Janet
 |  | (7) Robert
 |  |  | (11) David
 |  |  |  | (14) James
 |  |  | (12) Ron
 |  |  | (13) Dan
 |  | (8) Laura
 |  | (9) Ann

 | (4) Margaret
 |  | (10) Ina

To sort siblings according to the empid value, form a binary string called sortcol for each employee. The string is made of concatenated employee IDs in the chain of management leading to each employee, converted to binary values. The anchor member is the starting point. It generates a binary value with the empid of the root employee. In each iteration, the recursive member appends the current employee ID converted to a binary value to the manager's sortcol. The outer query then sorts the result by sortcol. Remember that corresponding columns in both the anchor member and the recursive member must have the same data type, length, and precision. That's why the expression generating the sortcol value is converted to varbinary(900), even though an integer requires 4 bytes in its binary representation: 900 bytes cover 225 levels, which seems more than a reasonable limitation. If you want support for more levels, you can increase this length but make sure you do so in both members; otherwise, you will get an error.

Hierarchical indentation is achieved by replicating a character string (' | ' in this case) as many times as the number levels of the employee. To that, the employee ID itself is appended within parentheses, and finally the employee name is also appended.

A similar technique can be used to sort siblings by other attributes that can be converted to small fixed-length binary values; for example, the employee's hire date stored in a smalldatetime column. If you want to sort siblings by attributes that are not convertible to small fixed-sized binary values, such as by employee name, you can first produce integer row numbers (for details on row numbers, see the section "Ranking Functions" earlier in this paper) partitioned by manager ID representing the desired sort like so:

SELECT empid, empname, mgrid,
  ROW_NUMBER() OVER(PARTITION BY mgrid ORDER BY empname) AS pos
FROM Employees

And instead of concatenating employee ID's converted to binary values, concatenate the employee positions converted to binary values:

WITH EmpPos(empid, empname, mgrid, pos)
AS
(
  SELECT empid, empname, mgrid,
    ROW_NUMBER() OVER(PARTITION BY mgrid ORDER BY empname) AS pos
  FROM Employees
),
EmpCTE(empid, empname, mgrid, lvl, sortcol)
AS
( 
  SELECT empid, empname, mgrid, 0,

    CAST(pos AS VARBINARY(900))
  FROM EmpPos
  WHERE empid = 1
  UNION ALL
  SELECT E.empid, E.empname, E.mgrid, M.lvl+1,
    CAST(sortcol + CAST(E.pos AS BINARY(4)) AS VARBINARY(900))
  FROM EmpPos AS E
    JOIN EmpCTE AS M
      ON E.mgrid = M.empid
)
SELECT
  REPLICATE(' | ', lvl)
    + '(' + (CAST(empid AS VARCHAR(10))) + ') '
    + empname AS empname
FROM EmpCTE
ORDER BY sortcol
(1) Nancy
 | (2) Andrew
 |  | (6) Michael
 |  | (5) Steven
 | (3) Janet
 |  | (9) Ann
 |  | (8) Laura
 |  | (7) Robert
 |  |  | (13) Dan
 |  |  | (11) David
 |  |  |  | (14) James
 |  |  | (12) Ron
 | (4) Margaret
 |  | (10) Ina

To sort siblings by any other attribute or combination of attributes, simply specify the desired attributes in the ORDER BY list of the ROW_NUMBER function's OVER option instead of empname.

Multiparent Environment: Bill of Materials

In the previous section, CTEs are used to handle hierarchies in which each node in the tree has only a single parent. A more complex scenario of relationships is a graph in which each node might have more than one parent. This section describes the use of CTEs in a bill of materials (BOM) scenario. The BOM is an Acyclic Directed Graph, meaning that each node can have more than one parent; a node cannot be a parent of itself, directly or indirectly; the relationship between two nodes is not dual (for example, A contains C, but C does not contain A). Figure 2 shows the relationships between items in a BOM scenario.

ms345144.sql2k5b2_tsqlenhance_02(en-US,SQL.90).gif

Figure 2. Multiparent environment

Item A contains items D, B and C; item C contains B and E; item B is contained in items A and C, and so on. The following code creates the Items and BOM tables and populates them with sample data:

CREATE TABLE Items
(
  itemid   VARCHAR(5)  NOT NULL PRIMARY KEY,
  itemname VARCHAR(25) NOT NULL,
  /* other columns, e.g., unit_price, measurement_unit */
)
CREATE TABLE BOM
(
  itemid     VARCHAR(5) NOT NULL REFERENCES Items,
  containsid VARCHAR(5) NOT NULL REFERENCES Items,
  qty        INT        NOT NULL
  /* other columns, e.g., quantity */
  PRIMARY KEY(itemid, containsid),

  CHECK (itemid <> containsid)
)
SET NOCOUNT ON
INSERT INTO Items(itemid, itemname) VALUES('A', 'Item A')
INSERT INTO Items(itemid, itemname) VALUES('B', 'Item B')
INSERT INTO Items(itemid, itemname) VALUES('C', 'Item C')
INSERT INTO Items(itemid, itemname) VALUES('D', 'Item D')
INSERT INTO Items(itemid, itemname) VALUES('E', 'Item E')
INSERT INTO Items(itemid, itemname) VALUES('F', 'Item F')
INSERT INTO Items(itemid, itemname) VALUES('G', 'Item G')
INSERT INTO Items(itemid, itemname) VALUES('H', 'Item H')
INSERT INTO Items(itemid, itemname) VALUES('I', 'Item I')
INSERT INTO Items(itemid, itemname) VALUES('J', 'Item J')
INSERT INTO Items(itemid, itemname) VALUES('K', 'Item K')
INSERT INTO BOM(itemid, containsid, qty) VALUES('E', 'J', 1)
INSERT INTO BOM(itemid, containsid, qty) VALUES('C', 'E', 3)
INSERT INTO BOM(itemid, containsid, qty) VALUES('A', 'C', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('H', 'C', 4)
INSERT INTO BOM(itemid, containsid, qty) VALUES('C', 'B', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('B', 'F', 1)
INSERT INTO BOM(itemid, containsid, qty) VALUES('B', 'G', 3)
INSERT INTO BOM(itemid, containsid, qty) VALUES('A', 'B', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('A', 'D', 2)
INSERT INTO BOM(itemid, containsid, qty) VALUES('H', 'I', 1)

The Items table contains a row for each item. The BOM table contains the relationships between the nodes in the graph. Each relationship is made up of the parent item ID (itemid), the child item ID (containsid), and the quantity of containsid within itemid (qty).

A common request in a BOM scenario is to "explode" an item: that is, to traverse the graph, starting with the given item and return all the items that it contains, directly or indirectly. This might sound familiar because it's similar to returning a subtree out of a tree as in the Employees Organizational Chart. In a directed graph, however, the request is a bit more complex conceptually, because one contained item can be reached from several different containing items through different paths. For example, suppose you want to explode item A. Notice that two different paths lead from it to item B: A->B and A->C->B. This means that item B would be reached twice, which means that all of the items that B contains (F and G) would be reached twice. Fortunately, with CTEs, the implementation of such a request is as simple as implementing a request to get a subtree out of a tree:

WITH BOMCTE
AS
(
  SELECT *
  FROM BOM
  WHERE itemid = 'A'
  UNION ALL
  SELECT BOM.*
  FROM BOM
    JOIN BOMCTE
      ON BOM.itemid = BOMCTE.containsid
)
SELECT * FROM BOMCTE

Here is the result set:

itemid containsid qty        
------ ---------- -----------
A      B          2          
A      C          2          
A      D          2          
C      B          2          
C      E          3          
E      J          1          
B      F          1          
B      G          3          
B      F          1          
B      G          3          

The anchor member returns all the direct items that A contains from BOM. For each contained item returned by the previous iteration of the CTE, the recursive member returns the items that it contains by joining BOM to BOMCTE. Logically, (not necessarily the order in the output) (A, B), (A, C), (A, D) are returned first; then (B, F), (B, G), (C, B), (C, E); and lastly (B, F), (B, G), (E, J). Note that most requests from a BOM do not require you to show an item more then once in the final results. You can use the DISTINCT clause to eliminate duplicates if you want to show only "which" items were involved in the explosion:

WITH BOMCTE
AS
(
  SELECT *
  FROM BOM
  WHERE itemid = 'A'
  UNION ALL
  SELECT BOM.*
  FROM BOM
    JOIN BOMCTE
      ON BOM.itemid = BOMCTE.containsid
)
SELECT DISTINCT containsid FROM BOMCTE

Here is the result set:

containsid
----------
B         
C         
D         
E         
F         
G         
J         

To help you understand the process of parts explosion, visualize its intermediate result as a tree in which all items are expanded to their contained items. Figure 3 shows the trees formed by exploding parts A and H along with the item quantities.

ms345144.sql2k5b2_tsqlenhance_03(en-US,SQL.90).gif

Figure 3. Parts explosion

Taking the original request a step further, you might be more interested in getting the cumulative quantities of each item rather than getting the items themselves. For example, A contains 2 units of C. C contains 3 units of E. E contains one unit of J. The total number of units of J required for A is a product of the quantities along the path leading from A to J: 2*3*1 = 6. Figure 4 shows the cumulative quantities of each item that makes A before aggregating the items.

ms345144.sql2k5b2_tsqlenhance_04(en-US,SQL.90).gif

Figure 4. Parts explosioncalculated quantities

The following CTE calculates the cumulative product of quantities:

WITH BOMCTE(itemid, containsid, qty, cumulativeqty)
AS
(
  SELECT *, qty
  FROM BOM
  WHERE itemid = 'A'
  UNION ALL
  SELECT BOM.*, BOM.qty * BOMCTE.cumulativeqty
  FROM BOM
    JOIN BOMCTE
      ON BOM.itemid = BOMCTE.containsid
)
SELECT * FROM BOMCTE

Here is the result set:

itemid containsid qty         cumulativeqty
------ ---------- ----------- -------------
A      B          2           2            
A      C          2           2            
A      D          2           2            
C      B          2           4            
C      E          3           6            
E      J          1           6            
B      F          1           4            
B      G          3           12           
B      F          1           2            
B      G          3           6            

This CTE adds the cumulativeqty column to the previous CTE. The anchor member returns the quantities of the contained items as cumulativeqty. For each of the next level's contained item, the recursive member multiplies its quantity by its containing item's cumulative quantity. Note that items that were reached from multiple paths appear multiple times in the results, each with the cumulative quantities for each path. Such an output is not very meaningful by itself, but it is helpful to understand the intermediate step to the final results in which each item appears only once. To get the total quantities of each item in A, have the outer query group the result by containsid:

WITH BOMCTE(itemid, containsid, qty, cumulativeqty)
AS
(

  SELECT *, qty
  FROM BOM
  WHERE itemid = 'A'
  UNION ALL
  SELECT BOM.*, BOM.qty * BOMCTE.cumulativeqty
  FROM BOM
    JOIN BOMCTE
      ON BOM.itemid = BOMCTE.containsid
)
SELECT containsid AS itemid, SUM(cumulativeqty) AS totalqty
FROM BOMCTE
GROUP BY containsid

Here is the result set:

itemid totalqty   
------ -----------
B      6          
C      2          
D      2          
E      6          
F      6          
G      18         
J      6          

PIVOT and UNPIVOT

PIVOT and UNPIVOT are new relational operators that you specify in the FROM clause of a query. They perform some manipulation on an input table-valued expression and produce an output table as a result. The PIVOT operator rotates rows into columns, possibly performing aggregations along the way. It widens the input table expression based on a given pivot column, generating an output table with a column for each unique value in the pivot column. The UNPIVOT operator performs the opposite operation of that performed by the PIVOT operator; it rotates columns into rows. It narrows the input table expression based on a pivot column.

PIVOT

The PIVOT operator is useful for handling open-schema scenarios and for generating crosstab reports.

In an open-schema scenario, you maintain entities with sets of attributes that are either not known ahead or different for each entity type. The users of your application define the attributes dynamically. Instead of predefining many columns and storing many null values in your tables, you split the attributes into different rows and store only the relevant attributes for each entity instance.

PIVOT allows you to generate crosstab reports for open-schema and other scenarios in which you rotate rows into columns, possibly calculating aggregations along the way and presenting the data in a useful form.

An example of an open-schema scenario is a database that keeps track of items put up for auction. Some attributes are relevant for all auction items, such as the item type, when it was made, and its initial price. Only the attributes that are relevant for all items are stored in the AuctionItems table:

CREATE TABLE AuctionItems
(
  itemid       INT          NOT NULL PRIMARY KEY NONCLUSTERED,
  itemtype     NVARCHAR(30) NOT NULL,
  whenmade     INT          NOT NULL,
  initialprice MONEY        NOT NULL,
  /* other columns */
)
CREATE UNIQUE CLUSTERED INDEX idx_uc_itemtype_itemid
  ON AuctionItems(itemtype, itemid)
INSERT INTO AuctionItems VALUES(1, N'Wine',     1822,      3000)
INSERT INTO AuctionItems VALUES(2, N'Wine',     1807,       500)
INSERT INTO AuctionItems VALUES(3, N'Chair',    1753,    800000)
INSERT INTO AuctionItems VALUES(4, N'Ring',     -501,   1000000)
INSERT INTO AuctionItems VALUES(5, N'Painting', 1873,   8000000)
INSERT INTO AuctionItems VALUES(6, N'Painting', 1889,   8000000)

Other attributes are specific to the item type, and new items of different types are continuously being added. Such attributes can be stored in a different ItemAttributes table in which each item attribute is stored in a different row. Each row contains the item ID, attribute name, and attribute value:

CREATE TABLE ItemAttributes
(
  itemid    INT          NOT NULL REFERENCES AuctionItems,

  attribute NVARCHAR(30) NOT NULL,
  value     SQL_VARIANT  NOT NULL, 
  PRIMARY KEY (itemid, attribute)
)
INSERT INTO ItemAttributes
  VALUES(1, N'manufacturer', CAST(N'ABC'              AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(1, N'type',         CAST(N'Pinot Noir'       AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(1, N'color',        CAST(N'Red'              AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(2, N'manufacturer', CAST(N'XYZ'              AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(2, N'type',         CAST(N'Porto'            AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(2, N'color',        CAST(N'Red'              AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(3, N'material',     CAST(N'Wood'             AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(3, N'padding',      CAST(N'Silk'             AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(4, N'material',     CAST(N'Gold'             AS NVARCHAR(15)))
INSERT INTO ItemAttributes
  VALUES(4, N'inscription',  CAST(N'One ring ...'     AS NVARCHAR(50)))
INSERT INTO ItemAttributes
  VALUES(4, N'size',         CAST(10                  AS INT))
INSERT INTO ItemAttributes
  VALUES(5, N'artist',       CAST(N'Claude Monet'     AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(5, N'name',         CAST(N'Field of Poppies' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(5, N'type',         CAST(N'Oil'              AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(5, N'height',       CAST(19.625              AS NUMERIC(9,3)))
INSERT INTO ItemAttributes
  VALUES(5, N'width',        CAST(25.625              AS NUMERIC(9,3)))
INSERT INTO ItemAttributes
  VALUES(6, N'artist',       CAST(N'Vincent Van Gogh' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(6, N'name',         CAST(N'The Starry Night' AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(6, N'type',         CAST(N'Oil'              AS NVARCHAR(30)))
INSERT INTO ItemAttributes
  VALUES(6, N'height',       CAST(28.75               AS NUMERIC(9,3)))
INSERT INTO ItemAttributes
  VALUES(6, N'width',        CAST(36.25               AS NUMERIC(9,3)))

Note that the sql_variant data type is used for the value column because different attribute values may be of different data types. For example, the size attribute stores an integer attribute value, and a name attribute stores a character string attribute value.

Suppose you want to present the data from the ItemAttributes table with a row for each item that is a painting (items 5,6) and a column for each attribute. Without the PIVOT operator, you must write a query such as:

SELECT
  itemid,
  MAX(CASE WHEN attribute = 'artist'  THEN value END) AS [artist],
  MAX(CASE WHEN attribute = 'name'    THEN value END) AS [name],
  MAX(CASE WHEN attribute = 'type'    THEN value END) AS [type],
  MAX(CASE WHEN attribute = 'height'  THEN value END) AS [height],
  MAX(CASE WHEN attribute = 'width'   THEN value END) AS [width]
FROM ItemAttributes AS ATR
WHERE itemid IN(5,6)
GROUP BY itemid

Here is the result set:

itemid artist           name             type       height width
------ ---------------- ---------------- ---------- ------ ------
5      Claude Monet     Field of Poppies Oil        19.625 25.625
6      Vincent Van Gogh The Starry Night Oil        28.750 36.250

The PIVOT operator allows you to maintain shorter and more readable code to achieve the same results:

SELECT *
FROM ItemAttributes AS ATR
  PIVOT

  (
    MAX(value)
    FOR attribute IN([artist], [name], [type], [height], [width])
  ) AS PVT
WHERE itemid IN(5,6)

As with most new features, understanding the PIVOT operator comes with experimentation and use. Some of the elements in the PIVOT syntax are apparent and only require that you figure out the relationship of these elements to the query that does not use the new operator. Others are hidden.

You may find the following terms helpful in understanding the semantics of the PIVOT operator:

table_expression

The virtual table on which the PIVOT operator works (the part in the query between the FROM clause and the PIVOT operator): ItemAttributes AS ATR in this case.

pivot_column

The column from table_expression whose values you want to rotate into result columns: attribute in this case.

column_list

The list of values from pivot_column that you want to present as result columns (in the parentheses followed by the IN clause). These must be expressed as legal identifiers: [artist], [name], [type], [height], [width] in this case.

aggregate_function

The aggregate function that you use to generate the data or column values in the result: MAX() in this case.

value_column

The column from table_expression that you use as the argument for aggregate_function: value in this case.

group_by_list

The hidden part—ALL columns from table_expression excluding pivot_column and value_column which are used to group the result: itemid in this case.

select_list

The list of columns following the SELECT clause that might include any column(s) from group_by_list and column_list. Aliases can be used to change the name of the result columns: * in this case returns all columns from group_by_list and column_list.

The PIVOT operator returns one row for each unique value in group_by_list as if you had a query with a GROUP BY clause and specified those columns. Notice that group_by_list is implied; it is not specified explicitly anywhere in the query. It contains all columns from table_expression excluding pivot_column and value_column. Understanding this is probably the key to understanding why queries you write with the PIVOT operator work as they do, and why you might get errors in some cases.

Possible result columns include values from group_by_list and <column_list>. If you specify an asterisk (*), the query returns both lists. The data part of the result columns or the result column values are calculated by aggregate_function with value_column as the argument.

The following color-highlighted code illustrates the different elements in the query using the PIVOT operator:

SELECT * -- itemid, [artist], [name], [type], [height], [width]
FROM ItemAttributes AS ATR
  PIVOT
  (
    MAX(value)
    FOR attribute IN([artist], [name], [type], [height], [width])
  ) AS PVT
WHERE itemid IN(5,6)

And the following relates the different elements to the query that does not use the PIVOT operator:

SELECT
  itemid,
  MAX(CASE WHEN attribute = 'artist'  THEN value END) AS [artist],
  MAX(CASE WHEN attribute = 'name'    THEN value END) AS [name],
  MAX(CASE WHEN attribute = 'type'    THEN value END) AS [type],
  MAX(CASE WHEN attribute = 'height'  THEN value END) AS [height],
  MAX(CASE WHEN attribute = 'width'   THEN value END) AS [width]
FROM ItemAttributes AS ATR
WHERE itemid IN(5,6)
GROUP BY itemid

Note that you must explicitly specify the values in <column_list>. The PIVOT operator does not provide an option to derive those dynamically from pivot_column in a static query. You can use dynamic SQL to construct the query string yourself to achieve this.

Taking the previous PIVOT query a step further, suppose you want to return, for each auction item, all attributes relevant to paintings. You want to include those attributes that appear in AuctionItems and those that appear in ItemAttributes. You might try the following query, which returns an error:

SELECT *
FROM AuctionItems AS ITM
  JOIN ItemAttributes AS ATR
    ON ITM.itemid = ATR.itemid
  PIVOT

  (
    MAX(value)
    FOR attribute IN([artist], [name], [type], [height], [width])
  ) AS PVT
WHERE itemtype = 'Painting'

Here is the error message:

.Net SqlClient Data Provider: Msg 8156, Level 16, State 1, Line 1

The column 'itemid' was specified multiple times for 'PVT'.

Remember that PIVOT works on table_expression, which is the virtual table returned by the section in the query between the FROM clause and the PIVOT clause. In this query, the virtual table contains two instances of the itemid column—one originating from AuctionItems and the other from ItemAttributes. You might be tempted to revise the query as follows, but you will also get an error:

SELECT ITM.itemid, itemtype, whenmade, initialprice, 
  [artist], [name], [type], [height], [width]
FROM AuctionItems AS ITM
  JOIN ItemAttributes AS ATR
    ON ITM.itemid = ATR.itemid
  PIVOT
  (
    MAX(value)
    FOR attribute IN([artist], [name], [type], [height], [width])
  ) AS PVT
WHERE itemtype = 'Painting'

Here is the error message:

.Net SqlClient Data Provider: Msg 8156, Level 16, State 1, Line 1
The column 'itemid' was specified multiple times for 'PVT'.
.Net SqlClient Data Provider: Msg 107, Level 15, State 1, Line 1
The column prefix 'ITM' does not match with a table name or alias name used in the query.

As mentioned earlier, the PIVOT operator works on the virtual table returned by table_expression and not on the columns in the select_list. The select_list is evaluated after the PIVOT operator performs its manipulations and can refer only to group_by_list and column_list. That is why the ITM alias is no longer recognized in the select_list. If you understand this, you realize that you should provide PIVOT with a table_expression that contains only the columns you want to work on. This includes the grouping columns (only one occurrence of itemid plus itemtype, whenmade and initialprice), the pivot column (attribute), and the value column (value). You can achieve this by using CTEs or derived tables. Here is an example using a CTE:

WITH PNT
AS
(
  SELECT ITM.*, ATR.attribute, ATR.value
  FROM AuctionItems AS ITM
    JOIN ItemAttributes AS ATR
      ON ITM.itemid = ATR.itemid
  WHERE ITM.itemtype = 'Painting'
)
SELECT * FROM PNT
  PIVOT
  (
    MAX(value)
    FOR attribute IN([artist], [name], [type], [height], [width])
  ) AS PVT

Here is the result set:

itemid itemtype whenmade initialprice artist           name             type height width
------ -------- -------- ------------ ---------------- ---------------- ---- ------ -----
5      Painting 1873     8000000.0000 Claude Monet     Field of Poppies Oil  19.62  25.62
6      Painting 1889     8000000.0000 Vincent Van Gogh The Starry Night Oil  28.75  36.25

Here is an example using a derived table:

SELECT *
FROM (SELECT ITM.*, ATR.attribute, ATR.value
      FROM AuctionItems AS ITM
         JOIN ItemAttributes AS ATR
         ON ITM.itemid = ATR.itemid
      WHERE ITM.itemtype = 'Painting') AS PNT
  PIVOT

  (
    MAX(value)
    FOR attribute IN([artist], [name], [type], [height], [width])
  ) AS PVT

You can also use PIVOT when you want to generate a crosstab report to summarize data. For example, using the Purchasing.PurchaseOrderHeader table in the AdventureWorks database, suppose you want to return the number of orders each employee made using each purchasing method, pivoting the purchase method IDs to columns. Keeping in mind that you should provide the PIVOT operator with only the relevant data, you use a derived table and write the following query:

SELECT EmployeeID, [1] AS SM1, [2] AS SM2,
  [3] AS SM3, [4] AS SM4, [5] AS SM5
FROM (SELECT PurchaseOrderID, EmployeeID, ShipMethodID
      FROM Purchasing.PurchaseOrderHeader) ORD
  PIVOT
  (
   COUNT(PurchaseOrderID)
   FOR ShipMethodID IN([1], [2], [3], [4], [5])
  ) AS PVT

Here is the result set:

EmployeeID  SM1         SM2         SM3         SM4         SM5
----------- ----------- ----------- ----------- ----------- -----------
164         56          62          12          89          141
198         24          27          6           45          58
223         56          67          17          98          162
231         50          67          12          81          150
233         55          62          12          106         125
238         53          58          13          102         134
241         50          59          13          108         130
244         55          47          17          93          148
261         58          54          11          120         117
264         50          58          15          86          151
266         58          68          14          116         144
274         24          26          6           41          63

The COUNT(PurchaseOrderID) function counts the number of rows for each ship method in the list. Note that PIVOT disallows the use of COUNT(*). Column aliases are used to provide more descriptive names to the result columns. Using PIVOT to show an order count for each ship method in a different column is reasonable when you have a small number of ship methods whose IDs are known ahead of time.

You can also pivot values that are derived from expressions. For example, suppose you want to return the total freight value for each employee in each order year, pivoting the years to columns. The order year is derived from the OrderDate column:

SELECT EmployeeID, [2001] AS Y2001, [2002] AS Y2002,
  [2003] AS Y2003, [2004] AS Y2004
FROM (SELECT EmployeeID, YEAR(OrderDate) AS OrderYear, Freight
      FROM Purchasing.PurchaseOrderHeader) AS ORD
  PIVOT
  (
   SUM(Freight)
   FOR OrderYear IN([2001], [2002], [2003], [2004])
  ) AS PVT

Here is the result set:

EmployeeID  Y2001       Y2002       Y2003       Y2004
----------- ----------- ----------- ----------- ------------
164         509.9325    14032.0215  34605.3459  105087.7428
198         NULL        5344.4771   14963.0595  45020.9178
223         365.7019    12496.0776  37489.2896  117599.4156
231         6.8025      9603.0502   37604.3258  75435.8619
233         1467.1388   9590.7355   32988.0643  98603.745
238         17.3345     9745.1001   37836.583   100106.3678
241         221.1825    6865.7299   35559.3883  114430.983
244         5.026       5689.4571   35449.316   74690.3755
261         NULL        10483.27    32854.9343  73992.8431
264         NULL        10337.3207  37170.1957  82406.4474
266         4.2769      9588.8228   38533.9582  115291.2472
274         NULL        1877.2665   13708.9336  41011.3821

Crosstab reports are common in data warehouse scenarios. Consider the following OrdersFact table, which you populate with sales orders and sales order details data from AdventureWorks:

CREATE TABLE OrdersFact
(
  OrderID    INT      NOT NULL,
  ProductID  INT      NOT NULL,
  CustomerID NCHAR(5) NOT NULL,
  OrderYear  INT      NOT NULL,
  OrderMonth INT      NOT NULL,

  OrderDay   INT      NOT NULL,
  Quantity   INT      NOT NULL,
  PRIMARY KEY(OrderID, ProductID)
)
INSERT INTO OrdersFact
  SELECT O.SalesOrderID, OD.ProductID, O.CustomerID,
    YEAR(O.OrderDate) AS OrderYear, MONTH(O.OrderDate) AS OrderMonth,
    DAY(O.OrderDate) AS OrderDay, OD.OrderQty
  FROM Sales.SalesOrderHeader AS O
    JOIN Sales.SalesOrderDetail AS OD
      ON O.SalesOrderID = OD.SalesOrderID

To get the total quantities for each year and month, returning years in rows and months in columns, you use the following query:

SELECT *
FROM (SELECT OrderYear, OrderMonth, Quantity
      FROM OrdersFact) AS ORD
 PIVOT
 (
  SUM(Quantity)
  FOR OrderMonth IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])
 ) AS PVT

Here is the result set:

OrderYear  1     2     3     4     5     6     7     8     9     10    11    12
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
2001       NULL  NULL  NULL  NULL  NULL  NULL  966   2209  1658  1403  3132  2480
2002       1040  2303  1841  1467  3179  2418  7755  11325 9066  5584  8268  6672
2003       3532  5431  4132  5694  8278  6444  11288 18986 18681 11607 14771 15855
2004       9227  10999 11314 12239 15656 15805 2209  NULL  NULL  NULL  NULL  NULL

PIVOT returns null values for nonexistent intersections between year and month. A year appears in the result if it appears in the input table expression (the derived table ORD), regardless of whether or not it has an intersection with any of the specified months. This means that you might get a row with NULL in all columns if you do not specify all existing months. However, null values in the result do not necessarily represent nonexistent intersections. They might result from base null values in the quantity column, unless the column disallows null values. If you want to override NULL and see another value instead, for example 0, you can do so by using the ISNULL() function in the select list:

SELECT OrderYear,
  ISNULL([1],  0) AS M01,
  ISNULL([2],  0) AS M02,
  ISNULL([3],  0) AS M03,
  ISNULL([4],  0) AS M04,
  ISNULL([5],  0) AS M05,
  ISNULL([6],  0) AS M06,
  ISNULL([7],  0) AS M07,
  ISNULL([8],  0) AS M08,
  ISNULL([9],  0) AS M09,
  ISNULL([10], 0) AS M10,
  ISNULL([11], 0) AS M11,
  ISNULL([12], 0) AS M12
FROM (SELECT OrderYear, OrderMonth, Quantity
      FROM OrdersFact) AS ORD
 PIVOT
 (
  SUM(Quantity)
  FOR OrderMonth IN([1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12])
 ) AS PVT

Here is the result set:

OrderYear  1     2     3     4     5     6     7     8     9     10    11    12
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
2001       0     0     0     0     0     0     966   2209  1658  1403  3132  2480
2002       1040  2303  1841  1467  3179  2418  7755  11325 9066  5584  8268  6672
2003       3532  5431  4132  5694  8278  6444  11288 18986 18681 11607 14771 15855
2004       9227  10999 11314 12239 15656 15805 2209  0     0     0     0     0

Using ISNULL(Quantity, 0) within the derived table would only take care of base null values in the Quantity column (if such existed) not of null values that PIVOT generated for nonexistent intersections.

Suppose you want to return the total quantities for each customer ID in the range 1–9 against a combination of year and month values in the first quarter of each of the years 2003 and 2004. To get the year and month values in rows and customer IDs in columns, use the following query:

SELECT *
FROM (SELECT CustomerID, OrderYear, OrderMonth, Quantity
      FROM OrdersFact
      WHERE CustomerID BETWEEN 1 AND 9
        AND OrderYear IN(2003, 2004)

        AND OrderMonth IN(1, 2, 3)) AS ORD
  PIVOT
  (
    SUM(Quantity)
    FOR CustomerID IN([1],[2],[3],[4],[5],[6],[7],[8],[9])
  ) AS PVT

Here is the result set:

OrderYear   OrderMonth  1    2    3    4    5    6    7    8    9
----------- ----------- ---- ---- ---- ---- ---- ---- ---- ---- ----
2003        1           NULL NULL NULL 105  NULL NULL 8    NULL NULL
2004        1           NULL NULL NULL 80   NULL NULL NULL NULL NULL
2003        2           NULL 5    NULL NULL NULL NULL NULL NULL 15
2004        2           NULL 10   NULL NULL NULL NULL NULL 6    3
2003        3           NULL NULL 105  NULL 15   NULL NULL NULL NULL
2004        3           NULL NULL 103  NULL 25   4    NULL NULL NULL 

The implied group-by list in this case is OrderYear and OrderMonth, because CustomerID and Quantity are used as the pivot and value columns, respectively.

However, if you want the combination of year and month values to appear as columns, you must concatenate them yourself before passing them to the PIVOT operator, because there can be only one pivot column:

SELECT *
FROM (SELECT CustomerID, OrderYear*100+OrderMonth AS YM, Quantity
      FROM OrdersFact
      WHERE CustomerID BETWEEN 1 AND 9
        AND OrderYear IN(2003, 2004)
        AND OrderMonth IN(1, 2, 3)) AS ORD
  PIVOT
  (
    SUM(Quantity)
    FOR YM IN([200301],[200302],[200303],[200401],[200402],[200403])
  ) AS PVT

Here is the result set:

CustomerID 200301 200302 200303 200401 200402 200403
---------- ------ ------ ------ ------ ------ ------
2          NULL   5      NULL   NULL   10     NULL
3          NULL   NULL   105    NULL   NULL   103
6          NULL   NULL   NULL   NULL   NULL   4
4          105    NULL   NULL   80     NULL   NULL
8          NULL   NULL   NULL   NULL   6      NULL
5          NULL   NULL   15     NULL   NULL   25

7          8      NULL   NULL   NULL   NULL   NULL
9          NULL   15     NULL   NULL   3      NULL

UNPIVOT

The UNPIVOT operator allows you to normalize prepivoted data. The syntax and elements of the UNPIVOT operator are similar to those of the PIVOT operator.

For example, consider the AuctionItems table from the previous section:

itemid      itemtype                 whenmade    initialprice
----------- ------------------------ ----------- --------------
1           Wine                      1822          3000.0000
2           Wine                      1807           500.0000
3           Chair                    1753         800000.0000
4           Ring                      -501       1000000.0000
5           Painting                  1873       8000000.0000
6           Painting                  1889       8000000.0000

Suppose you want each attribute to appear in a different row similar to the way attributes are kept in the ItemAttributes table:

itemid      attribute       value
----------- --------------- -------
1           itemtype        Wine
1           whenmade        1822
1           initialprice    3000.00
2           itemtype        Wine
2           whenmade        1807
2           initialprice    500.00
3           itemtype        Chair
3           whenmade        1753
3           initialprice    800000.00
4           itemtype        Ring
4           whenmade        -501
4           initialprice    1000000.00
5           itemtype        Painting
5           whenmade        1873
5           initialprice    8000000.00
6           itemtype        Painting
6           whenmade        1889
6           initialprice    8000000.00

In the UNPIVOT query, you want to rotate the columns itemtype, whenmade, and initialprice to rows. Each row should have the item ID, attribute, and value. The new column names that you must provide are attribute and value. They correspond to the pivot_column and value_column in the PIVOT operator. The attribute column should get the actual column names that you want to rotate (itemtype, whenmade, and initialprice) as values. The value column should get the values from the three different source columns into one destination column. To help clarify, first a version of an UNPIVOT query that is not valid is presented, followed by a valid one in which you apply some restrictions:

SELECT itemid, attribute, value
FROM AuctionItems
  UNPIVOT
  (
    value FOR attribute IN([itemtype], [whenmade], [initialprice])
  ) AS UPV

As arguments to the PIVOT operator, you provide a name for value_column (value in this case) followed by the FOR clause. Following the FOR clause, provide a name for pivot_column (attribute in this case) and then an IN clause with the list of source column names you want to get as values in pivot_column. This list of columns is referred to as <column_list> in the PIVOT operator. This query generates the following error:

.Net SqlClient Data Provider: Msg 8167, Level 16, State 1, Line 1

Type of column 'whenmade' conflicts with the type of other columns specified in the UNPIVOT list.

The destination value column contains values originating from several different source columns (those that appear in <column_list>). Because the target of all column values is a single column, UNPIVOT requires that all columns in <column_list> have the same data type, length, and precision. To meet this restriction, you can provide the UNPIVOT operator with a table expression that converts the three columns to the same data type. The sql_variant data type is a good candidate because you can convert the different source columns to the same data type and still retain their original data types. Applying this restriction, you revise the previous query as follows and get the result you want:

SELECT itemid, attribute, value
FROM (SELECT itemid,
        CAST(itemtype     AS SQL_VARIANT) AS itemtype,
        CAST(whenmade     AS SQL_VARIANT) AS whenmade,
        CAST(initialprice AS SQL_VARIANT) AS initialprice
      FROM AuctionItems) AS ITM
  UNPIVOT
  (
    value FOR attribute IN([itemtype], [whenmade], [initialprice])
  ) AS UPV

The data type of the result attribute column is sysname. This is the data type SQL Server uses for storing object names.

Note that the UNPIVOT operator eliminates null values in the value column from the result; therefore, it cannot be considered to be the exact reverse operation as the PIVOT operator.

Having rotated the columns in AuctionItems into rows, you can now union the result of the UNPIVOT operation with the rows from ItemAttributes to provide a unified result:

SELECT itemid, attribute, value
FROM (SELECT itemid,
        CAST(itemtype     AS SQL_VARIANT) AS itemtype,
        CAST(whenmade     AS SQL_VARIANT) AS whenmade,
        CAST(initialprice AS SQL_VARIANT) AS initialprice
      FROM AuctionItems) AS ITM
  UNPIVOT
  (
    value FOR attribute IN([itemtype], [whenmade], [initialprice])
  ) AS UPV
UNION ALL
SELECT *
FROM ItemAttributes
ORDER BY itemid, attribute

Here is the result set:

itemid      attribute       value
----------- --------------- -------------
1           color           Red
1           initialprice    3000.00
1           itemtype        Wine
1           manufacturer    ABC
1           type            Pinot Noir
1           whenmade        1822
2           color           Red
2           initialprice    500.00
2           itemtype        Wine
2           manufacturer    XYZ
2           type            Porto
2           whenmade        1807
3           initialprice    800000.00
3           itemtype        Chair

3           material        Wood
3           padding         Silk
3           whenmade        1753
4           initialprice    1000000.00
4           inscription     One ring
4           itemtype        Ring
4           material        Gold
4           size            10
4           whenmade        -501
5           height          19.625
5           initialprice    8000000.00
5           itemtype        Painting
5           name            Field of Poppies
5           artist          Claude Monet
5           type            Oil
5           whenmade        1873
5           width           25.625
6           height          28.750
6           initialprice    8000000.00
6           itemtype        Painting
6           name            The Starry Night
6           artist          Vincent Van Gogh
6           type            Oil
6           whenmade        1889
6           width           36.250

APPLY

The APPLY relational operator allows you to invoke a specified table-valued function once per each row of an outer table expression. You specify APPLY in the FROM clause of a query, similar to the way you use the JOIN relational operator. APPLY comes in two forms: CROSS APPLY and OUTER APPLY. With the APPLY operator, SQL Server 2005 Beta 2 allows you to refer to a table-valued function in a correlated subquery.

CROSS APPLY

CROSS APPLY invokes a table-valued function for each row in an outer table expression. You can refer to columns in the outer table as arguments to the table-valued function. CROSS APPLY returns a unified result set out of all of the results returned by the individual invocations of the table-valued function. If the table-valued function returns an empty set for a given outer row, that outer row is not returned in the result. For example, the following table-valued function accepts two integers as arguments and returns a table with one row, with the minimum and maximum values as columns:

CREATE FUNCTION dbo.fn_scalar_min_max(@p1 AS INT, @p2 AS INT) RETURNS TABLE
AS
RETURN
  SELECT
    CASE
      WHEN @p1 < @p2 THEN @p1
      WHEN @p2 < @p1 THEN @p2
      ELSE COALESCE(@p1, @p2)
    END AS mn,
    CASE
      WHEN @p1 > @p2 THEN @p1
      WHEN @p2 > @p1 THEN @p2
      ELSE COALESCE(@p1, @p2)
    END AS mx
GO
SELECT * FROM fn_scalar_min_max(10, 20)

Here is the result set:

mn          mx         
----------- -----------
10          20         

Given the following T1 table:

CREATE TABLE T1
(
  col1 INT NULL,
  col2 INT NULL
)

INSERT INTO T1 VALUES(10, 20)
INSERT INTO T1 VALUES(20, 10)
INSERT INTO T1 VALUES(NULL, 30)
INSERT INTO T1 VALUES(40, NULL)
INSERT INTO T1 VALUES(50, 50)

You want to invoke fn_scalar_min_max for each row in T1. You write a CROSS APPLY query as follows:

SELECT *
FROM T1 CROSS APPLY fn_scalar_min_max(col1, col2) AS M

Here is the result set:

col1        col2        mn          mx         
----------- ----------- ----------- -----------
10          20          10          20         
20          10          10          20         
NULL        30          30          30         
40          NULL        40          40         
50          50          50          50         

If the table-valued function returns multiple rows for a certain outer row, the outer row is returned multiple times. Consider the Employees table used earlier in this paper in the Recursive Queries and Common Table Expressions section (Employees Organizational Chart scenario). In the same database, you also create the following Departments table:

CREATE TABLE Departments
(
  deptid    INT NOT NULL PRIMARY KEY,
  deptname  VARCHAR(25) NOT NULL,
  deptmgrid INT NULL REFERENCES Employees
)
SET NOCOUNT ON
INSERT INTO Departments VALUES(1, 'HR',           2)
INSERT INTO Departments VALUES(2, 'Marketing',    7)
INSERT INTO Departments VALUES(3, 'Finance',      8)
INSERT INTO Departments VALUES(4, 'R&D',          9)
INSERT INTO Departments VALUES(5, 'Training',     4)
INSERT INTO Departments VALUES(6, 'Gardening', NULL)

Most departments have a manager ID that corresponds to an employee in the Employees table, but as in the case of the Gardening department, it is possible for a department to have no manager. Note that a manager in the Employees table does necessarily also manage a department. The following table-valued function accepts an employee ID as an argument and returns that employee and all of his or her subordinates in all levels:

CREATE FUNCTION dbo.fn_getsubtree(@empid AS INT) RETURNS @TREE TABLE
(
  empid   INT NOT NULL,
  empname VARCHAR(25) NOT NULL,
  mgrid   INT NULL,
  lvl     INT NOT NULL
)
AS
BEGIN
  WITH Employees_Subtree(empid, empname, mgrid, lvl)
  AS
  ( 
    -- Anchor Member (AM)
    SELECT empid, empname, mgrid, 0
    FROM employees
    WHERE empid = @empid
    UNION all
    -- Recursive Member (RM)
    SELECT e.empid, e.empname, e.mgrid, es.lvl+1
    FROM employees AS e
      JOIN employees_subtree AS es
        ON e.mgrid = es.empid
  )
  INSERT INTO @TREE
    SELECT * FROM Employees_Subtree
  RETURN
END
GO

To return all of the subordinates in all levels for the manager of each department, use the following query:

SELECT *
FROM Departments AS D
  CROSS APPLY fn_getsubtree(D.deptmgrid) AS ST

Here is the result set:

deptid      deptname   deptmgrid   empid       empname    mgrid       lvl
----------- ---------- ----------- ----------- ---------- ----------- ---
1           HR         2           2           Andrew     1           0
1           HR         2           5           Steven     2           1
1           HR         2           6           Michael    2           1
2           Marketing  7           7           Robert     3           0
2           Marketing  7           11          David      7           1
2           Marketing  7           12          Ron        7           1
2           Marketing  7           13          Dan        7           1
2           Marketing  7           14          James      11          2
3           Finance    8           8           Laura      3           0
4           R&D        9           9           Ann        3           0
5           Training   4           4           Margaret   1           0
5           Training   4           10          Ina        4           1

There are two things to notice here. First, each row from Departments is duplicated as many times as there are rows returned from fn_getsubtree for the department's manager. Second, the Gardening department does not appear in the result because fn_getsubtree returned an empty set for it.

Another practical use of the CROSS APPLY operator answers a common request: returning n rows for each group. For example, the following function returns the requested number of most recent orders for a given customer:

USE AdventureWorks
GO
CREATE FUNCTION fn_getnorders(@custid AS INT, @n AS INT)
  RETURNS TABLE
AS
RETURN
  SELECT TOP(@n) *
  FROM Sales.SalesOrderHeader
  WHERE CustomerID = @custid
  ORDER BY OrderDate DESC
GO

Using the CROSS APPLY operator, you can get the two most recent orders for each customer by using the following simple query:

SELECT O.*
FROM Sales.Customer AS C
  CROSS APPLY fn_getnorders(C.CustomerID, 2) AS O

For more information about the enhancement to TOP, see "TOP Enhancements" later in this paper.

OUTER APPLY

OUTER APPLY is very similar to CROSS APPLY, but it also returns rows from the outer table for which the table-valued function returned an empty set. Null values are returned as the column values that correspond to the columns of the table-valued function. For example, revise the query against the Departments table from the previous section to use OUTER APPLY instead of CROSS APPLY and notice the last row in the output:

SELECT *
FROM Departments AS D
  OUTER APPLY fn_getsubtree(D.deptmgrid) AS ST

Here is the result set:

deptid      deptname   deptmgrid   empid       empname    mgrid       lvl
----------- ---------- ----------- ----------- ---------- ----------- ---
1           HR         2           2           Andrew     1           0
1           HR         2           5           Steven     2           1
1           HR         2           6           Michael    2           1
2           Marketing  7           7           Robert     3           0
2           Marketing  7           11          David      7           1
2           Marketing  7           12          Ron        7           1
2           Marketing  7           13          Dan        7           1
2           Marketing  7           14          James      11          2
3           Finance    8           8           Laura      3           0
4           R&D        9           9           Ann        3           0
5           Training   4           4           Margaret   1           0
5           Training   4           10          Ina        4           1
6           Gardening  NULL        NULL        NULL       NULL        NULL

Table-Valued Functions in Correlated Subqueries

In SQL Server 2000, you cannot refer to table-valued functions within a correlated subquery. In conjunction with providing the APPLY relational operator, this restriction is removed in SQL Server 2005 Beta 2. Now, within a subquery, you can provide a table-valued function with columns from the outer query as arguments. For example, if you want to return only those departments whose manager has at least three employees, you can write the following query:

SELECT *
FROM Departments AS D
WHERE (SELECT COUNT(*)
       FROM fn_getsubtree(D.deptmgrid)) >= 3
deptid      deptname                  deptmgrid  
----------- ------------------------- -----------
1           HR                        2          
2           Marketing                 7          

Support for New DRI Actions:
SET DEFAULT and SET NULL

ANSI SQL defines four possible referential actions in support of a FOREIGN KEY constraint. You specify those actions that indicate how you want your system to react in response to a DELETE or UPDATE operation against a table that is referenced by a foreign key. SQL Server 2000 supports two of those actions: NO ACTION and CASCADE. SQL Server 2005 Beta 2 adds support for the SET DEFAULT and SET NULL referential actions.

The SET DEFAULT and SET NULL referential actions extend declarative referential integrity (DRI) capabilities. You use these options in conjunction with the ON UPDATE and ON DELETE clauses in a foreign key declaration. SET DEFAULT means that when, in a referenced table, you delete rows (ON DELETE) or update the referenced key (ON UPDATE), SQL Server sets the referencing column values of the related rows in the referencing table to the default value of the column. Similarly, SQL Server can react by setting the values to NULL if you use the SET NULL option, provided that the referencing column allows null values.

For example, the following Customers table has three real customers and a dummy customer:

CREATE TABLE Customers
(
  customerid CHAR(5) NOT NULL,
  /* other columns */
  CONSTRAINT PK_Customers PRIMARY KEY(customerid)
)

INSERT INTO Customers VALUES('DUMMY')
INSERT INTO Customers VALUES('FRIDA')
INSERT INTO Customers VALUES('GNDLF')
INSERT INTO Customers VALUES('BILLY')

The Orders table keeps track of orders. An order does not necessarily have to be assigned to a real customer. If you enter an order and do not specify a customer ID, the DUMMY customer ID is assigned to the order by default. Upon a delete from the Customers table, you want SQL Server to set NULL in the customerid column of the related rows in Orders. Orders with NULL in the customerid column become "orphans," that is, they belong to no customer. Suppose that you also want to allow updates to the customerid column in Customers. You might want to cascade the update to the related rows in Orders, but suppose that a business rule in your company dictates otherwise: orders belonging to a customer whose ID was changed should be related to the default customer (DUMMY). Upon an update to the customerid column in Customers, you want SQL Server to set the default value 'DUMMY' to the related customer IDs (customerid) in Orders. You create the Orders table with a foreign key as follows and populate it with some orders:

CREATE TABLE Orders
(
  orderid    INT      NOT NULL,
  customerid CHAR(5)  NULL DEFAULT('DUMMY'),
  orderdate  DATETIME NOT NULL,
  CONSTRAINT PK_Orders PRIMARY KEY(orderid),
  CONSTRAINT FK_Orders_Customers
    FOREIGN KEY(customerid)
    REFERENCES Customers(customerid)
      ON DELETE SET NULL
      ON UPDATE SET DEFAULT
)
INSERT INTO Orders VALUES(10001, 'FRIDA', '20040101')
INSERT INTO Orders VALUES(10002, 'FRIDA', '20040102')
INSERT INTO Orders VALUES(10003, 'BILLY', '20040101')
INSERT INTO Orders VALUES(10004, 'BILLY', '20040103')
INSERT INTO Orders VALUES(10005, 'GNDLF', '20040104')
INSERT INTO Orders VALUES(10006, 'GNDLF', '20040105')

To test the SET NULL and SET DEFAULT options, issue the following DELETE and UPDATE statements:

DELETE FROM Customers
WHERE customerid = 'FRIDA'
UPDATE Customers
  SET customerid = 'DOLLY'
WHERE customerid = 'BILLY'

As a result, FRIDA's orders are assigned with null values in the customerid column, and BILLY's orders with DUMMY:

orderid     customerid orderdate             
----------- ---------- ----------------------
10001       NULL       1/1/2004 12:00:00 AM  
10002       NULL       1/2/2004 12:00:00 AM  
10003       DUMMY      1/1/2004 12:00:00 AM  
10004       DUMMY      1/3/2004 12:00:00 AM  
10005       GNDLF      1/4/2004 12:00:00 AM  
10006       GNDLF      1/5/2004 12:00:00 AM  

Note that if you use the SET DEFAULT option and the referencing column has a non-null default value that does not have a corresponding value in the referenced table, you will get an error when issuing the triggering action. For example, if you delete the DUMMY customer from Customers and then update GNDLF's customerid to GLDRL, you will get an error. The UPDATE triggers a SET DEFAULT action that attempts to assign GNDLF's original orders with the DUMMY customer ID that does not have a corresponding row in Customers:

DELETE FROM Customers
WHERE customerid = 'DUMMY'
UPDATE Customers
  SET customerid = 'GLDRL'
WHERE customerid = 'GNDLF'
.Net SqlClient Data Provider: Msg 547, Level 16, State 0, Line 1
UPDATE statement conflicted with COLUMN FOREIGN KEY constraint 'FK_Orders_Customers'. The conflict occurred in database 'tempdb', table 'Customers', column 'customerid'.
The statement has been terminated.

You can find more information about foreign keys, including their defined referential actions, by viewing sys.foreign_keys.

Enhancements for Performance and Error Handling

This section covers enhancements that address performance issues in previous versions of SQL Server, increase your data loading capabilities, and dramatically improve your error management capabilities. These enhancements include BULK rowset provider and the TRY...CATCH error handling construct.

BULK Rowset Provider

BULK is a new rowset provider specified in the OPENROWSET function allowing you to access file data in a relational format. In order to retrieve data from a file, you specify the BULK option, the name of the file, and a format file created either with bcp.exe or manually. You can specify names for the result columns in the parentheses following the alias of the table returned from OPENROWSET.

Here is the new syntax of all the options that you can specify with OPENROWSET:

OPENROWSET
( { 'provider_name' 
, { 'datasource' ; 'user_id' ; 'password' | 'provider_string' } 
   , { [ catalog. ] [ schema. ] object | 'query' }   
| BULK 'data_filename', 
{FORMATFILE = 'format_file_path' [, <bulk_options>] |
    SINGLE_BLOB | SINGLE_CLOB | SINGLE_NCLOB} 
}
) 

<bulk_options> ::=
[ , CODEPAGE  = 'ACP' | 'OEM' | 'RAW' | 'code_page' ]  
[ , FIRSTROW = first_row ] 
[ , LASTROW = last_row ]
[ , ROWS_PER_BATCH = 'rows_per_batch']
[ , MAXERRORS = 'max_errors']
   [ , ERRORFILE ='file_name']
}
) 

For example, the following query returns three columns from the text file 'c:\temp\textfile1.txt' and provides the column aliases col1, col2, and col3 to the result columns:

SELECT col1, col2, col3
  FROM OPENROWSET(BULK 'c:\temp\textfile1.txt',
         FORMATFILE = 'c:\temp\textfile1.fmt') AS C(col1, col2, col3)

Note that when you use the BULK option, a format file must also be specified unless you use the SINGLE_BLOB, SINGLE_CLOB, or SINGLE_NCLOB options which I'll describe later. Therefore, there's no need to specify a data file type, field terminator, or row terminator. Other options that you can optionally specify along with the FORMATFILE include: CODEPAGE, FIRSTROW, LASTROW, ROW_PER_BATCH, MAXERRORS, and ERRORFILE. Most of the options were available with the BULK INSERT command in SQL Server 2000. The ERRORFILE option is conceptually new. This file contains zero or more rows that have formatting errors (that is, these rows cannot be converted to an OLEDB rowset) from an input data file. These rows are copied from the data file "as is" into this error file. Once the error is fixed, the data is already in the desired format so it can easily be reloaded using the same commands. The error file is created in the beginning of the command execution. An error will be raised if the file already exists. By looking at the rows in this file, it is easy to identify the rows that failed but there is no way to know the cause of the failure. To address this, a control file is automatically created with the extension .ERROR.txt. This file references each row in ERRORFILE and provides error diagnostics.

You can populate a table with the results returned from OPENROWSET using the BULK rowset provider, and specify table options for the bulk load operation. For example, the following code loads the results of the previous query to the table MyTable, requesting to disable constraint checking in the target table:

INSERT INTO MyTable WITH (IGNORE_CONSTRAINTS)
  SELECT col1, col2, col3
  FROM OPENROWSET(BULK 'c:\temp\textfile1.txt',
         FORMATFILE = 'c:\temp\textfile1.fmt') AS C(col1, col2, col3)

Besides the IGNORE_CONSTRAINTS option, other table hints that you can specify in the load operation include: BULK_KEEPIDENTITY, BULK_KEEPNULLS, and IGNORE_TRIGGERS.

You can also use the BULK provider to return file data as a single column value of a large object type by specifying one of the options: SINGLE_CLOB for character data, SINGLE_NCLOB for Unicode data, and SINGLE_BLOB for binary data. When you use one of these options, you don't specify a format file. You can load a file (using an INSERT or an UPDATE statement) into a large object column of one of the following data types: VARCHAR(MAX), NVARCHAR(MAX), VARBINARY(MAX), or XML. You can find details on the MAX specifier for variable-length columns and on the XML data type later in this paper.

As an example of loading a file into a large column, the following UPDATE statement loads the text file 'c:\temp\textfile101.txt' into the column txt_data in the table CustomerData for customer 101.:

UPDATE CustomerData
  SET txt_data  = (SELECT txt_data FROM OPENROWSET(
    BULK 'c:\temp\textfile101.txt', SINGLE_CLOB) AS F(txt_data))
WHERE custid = 101

Note that only one large column can be updated at a time.

The following example shows how to load a binary file for customer 102 into a large column using an INSERT statement:

INSERT INTO CustomerData(custid, binary_data)
  SELECT 102 AS custid, binary_data
  FROM OPENROWSET(
    BULK 'c:\temp\binfile102.dat', SINGLE_BLOB) AS F(binary_data)

Exception Handling

SQL Server 2005 Beta 2 introduces a simple but very powerful exception-handling mechanism in the form of a TRY...CATCH Transact-SQL construct.

Previous versions of SQL Server require you to include error-handling code after every statement that was suspect of error. To centralize error-checking code, you must use labels and GOTO statements. Furthermore, errors such as data type conversion errors cause a batch to terminate; therefore, you cannot trap these errors with Transact-SQL. SQL Server 2005 Beta 2 addresses many of these issues.

Errors that used to cause a batch to terminate can now be caught and handled, provided that those errors do not cause severance of the connection (typically errors with severity 21 and up, such as table or database integrity suspect, hardware errors, and so on).

Write the code you want to execute within a BEGIN TRY/END TRY block and follow with the error-handling code in a BEGIN CATCH/END CATCH block. Note that a TRY block must have a corresponding CATCH block; otherwise, you will get a syntax error. As a simple example, consider the following Employees table:

CREATE TABLE Employees
(
  empid   INT         NOT NULL,
  empname VARCHAR(25) NOT NULL,
  mgrid   INT         NULL,
  /* other columns */
  CONSTRAINT PK_Employees PRIMARY KEY(empid),
  CONSTRAINT CHK_Employees_empid CHECK(empid > 0),
  CONSTRAINT FK_Employees_Employees
    FOREIGN KEY(mgrid) REFERENCES Employees(empid)
)

You want to write code that inserts a new employee row into the table. You also want to respond to a failure situation with some corrective activity. Use the new TRY...CATCH construct as follows:

BEGIN TRY
  INSERT INTO Employees(empid, empname, mgrid)
     VALUES(1, 'Emp1', NULL)

  PRINT 'After INSERT.'
END TRY
BEGIN CATCH
  PRINT 'INSERT failed.'
  /* perform corrective activity */
END CATCH

When you run this code for the first time, you should get the output 'After INSERT.' When you run it for the second time, you should get the output 'INSERT Failed.'

If the code within the TRY block completes with no errors, control is passed to the first statement following the corresponding CATCH block. When a statement within the TRY block fails, control is passed to the first statement within the corresponding CATCH block. Note that if an error is trapped by a CATCH block, it is not returned to the calling application. If you also want the application to get the error information, you must provide the information to the application yourself (for example, using RAISERROR or as a result set of a query). All of the error information is available to you in the CATCH block by means of four new functions: ERROR_NUMBER(), ERROR_MESSAGE(), ERROR_SEVERITY(), and ERROR_STATE(). These functions can be queried multiple times, anywhere you like in the CATCH block, and their values remain the same. This is in contrast to the @@error function which is affected by any statement besides DECLARE, so it must be queried in the first statement of the CATCH block. ERROR_NUMBER() can be used as an alternative to @@error, while the other three functions give you the rest of the information exactly as it was generated by the error. Such info was not available in SQL Server releases earlier than SQL Server 2005.

If an unhandled error is generated in a batch or routine (stored procedure, trigger, user defined function, dynamic code), and an upper level of code invoked that batch or routine within a TRY block, control is passed to the upper level's corresponding CATCH block. If the upper level did not invoke the inner level within a TRY block, SQL Server will keep looking for a TRY block in upper levels in the call stack, and will pass control to the CATCH block of the first TRY...CATCH construct found. If none is found, the error is returned to the calling application.

As a more detailed example, the following code reacts differently depending on the type of error that caused the failure, and also prints messages indicating which parts of the code have been activated:

PRINT 'Before TRY...CATCH block.'
BEGIN TRY
  PRINT '  Entering TRY block.'
  INSERT INTO Employees(empid, empname, mgrid) VALUES(2, 'Emp2', 1)
  PRINT '    After INSERT.'
  PRINT '  Exiting TRY block.'
END TRY

BEGIN CATCH
  PRINT '  Entering CATCH block.'
  IF ERROR_NUMBER() = 2627
  BEGIN
    PRINT '    Handling PK violation...'
  END
  ELSE IF ERROR_NUMBER() = 547
  BEGIN
    PRINT '    Handling CHECK/FK constraint violation...'
  END
  ELSE IF ERROR_NUMBER() = 515
  BEGIN
    PRINT '    Handling NULL violation...'
  END
  ELSE IF ERROR_NUMBER() = 245
  BEGIN
    PRINT '    Handling conversion error...'
  END
  ELSE
  BEGIN
    PRINT '    Handling unknown error...'
  END
  PRINT '    Error Number:   ' + CAST(ERROR_NUMBER() AS VARCHAR(10))
  PRINT '    Error Message:  ' + ERROR_MESSAGE()
  PRINT '    Error Severity: ' + CAST(ERROR_SEVERITY() AS VARCHAR(10))
  PRINT '    Error State   : ' + CAST(ERROR_STATE() AS VARCHAR(10))
  PRINT '  Exiting CATCH block.'
END CATCH
PRINT 'After TRY...CATCH block.'

Notice that the ERROR_NUMBER() function is invoked multiple times in the CATCH block, and it always returns the number of the error that caused control to pass to the CATCH block. This code inserts employee 2 as a subordinate to the previously inserted employee 1 and should complete with no errors when run for the first time, generating the following output:

Before TRY...CATCH block.
  Entering TRY block.

    After INSERT.
  Exiting TRY block.
After TRY...CATCH block.

Notice that the CATCH block was skipped. Running this code for the second time should generate the following output:

Before TRY...CATCH block.
  Entering TRY block.
  Entering CATCH block.
    Handling PK violation...
    Error Number:   2627
    Error Message:  Violation of PRIMARY KEY constraint 'PK_Employees'. Cannot insert duplicate key in object 'Employees'.
    Error Severity: 14
    Error State   : 1
  Exiting CATCH block.
After TRY...CATCH block.

Notice that the TRY block was entered but not completed. As a result of the primary key violation error, control was passed to the CATCH block, which identified the error and handled it. Similarly, if you assign a value that is not valid employee ID data such as 0, which violates the CHECK constraint; NULL, which is not allowed in employeeid; and 'a,' which cannot be converted to INT; you get the appropriate error, and the appropriate handling code would be activated.

If you are using explicit transactions in the TRY block, you might want to investigate the transaction state in your error-handling code in the CATCH block to determine a course of action. SQL Server 2005 provides you with the new function XACT_STATE() which returns the transaction state. Possible return values from the function are: 0, -1, and 1. A 0 return value means no transaction is open. An attempt to commit or roll back the transaction would generate an error. A 1 return value means that a transaction is open and can be either committed or rolled back. You need to determine whether to commit or roll back the transaction depending on your needs and your error-handling logic. A -1 return value means that a transaction is open but is in an uncommittable state, which is a new transaction state introduced in SQL Server 2005. A transaction within a TRY block enters an uncommittable state when an error is generated that would otherwise cause the transaction to be aborted (typically, severity 17 or higher). An uncommittable transaction keeps all open locks and allows you only to read data. You cannot submit any activity that requires writes to the transaction log, meaning you cannot change data while the transaction is in an uncommittable state. In order to terminate the transaction, you must issue a rollback. You cannot commit the transaction, you can only roll it back before any modifications would be accepted. The following example demonstrates how to use the XACT_STATE() function:

BEGIN TRY
  BEGIN TRAN
    INSERT INTO Employees(empid, empname, mgrid) VALUES(3, 'Emp3', 1)

    /* other activity */
  COMMIT TRAN
  PRINT 'Code completed successfully.'
END TRY
BEGIN CATCH
  PRINT 'Error: ' + CAST(ERROR_NUMBER() AS VARCHAR(10)) + ' found.'
  IF (XACT_STATE()) = -1
  BEGIN
     PRINT 'Transaction is open but uncommittable.'
     /* ...investigate data... */
     ROLLBACK TRANSACTION -- can only ROLLBACK
     /* ...handle the error... */
  END
  ELSE IF (XACT_STATE()) = 1
  BEGIN
     PRINT 'Transaction is open and committable.'
     /* ...handle error... */
     COMMIT TRANSACTION -- or ROLLBACK
  END
  ELSE
  BEGIN
     PRINT 'No open transaction.'
     /* ...handle error... */
  END
END CATCH

The TRY block submits code within an explicit transaction. It inserts a new employee row and performs some other activity within the same transaction. The CATCH block prints the error number and investigates the transaction state to determine a course of action. If the transaction is open and uncommittable, the CATCH block investigates the data, rolls back the transaction, and then takes any corrective measures that require data modifications. If the transaction is open and committable, the CATCH block handles the error and commits (or could possibly roll back). If no transaction is open, the error is handled. No commit or rollback is issued. If you run this code for the first time, a new employee row for employee 3 is inserted and the code completes successfully with the following output:

Code completed successfully.

If you run this code for the second time, a primary key violation error is generated, and you get the following output:

Error: 2627 found.
Transaction is open and committable.

Other SQL Server 2005 Beta 2 Capabilities That Affect Transact-SQL

This section briefly describes other enhancements in SQL Server 2005 Beta 2 that affect Transact-SQL. These include enhancements to TOP, data manipulation language (DML) with results, MAX specifier for dynamic columns, XML/XQuery, data definition language (DDL) triggers, Queuing and SQL Server Service Broker, and DML events and notifications.

TOP Enhancements

In SQL Server version 7.0 and SQL Server 2000, the TOP option allows you to limit the number or percentage of rows returned by a SELECT query; however, you must provide a constant as an argument. In SQL Server 2005 Beta 2, TOP is enhanced in the following major ways:

  • You can now specify a numeric expression returning the number or percentage of rows to be affected by your query, optionally using variables and subqueries.
  • You can now use the TOP option in DELETE, UPDATE, and INSERT queries.

The new syntax for queries using the TOP option is:

SELECT [TOP (<expression>) [PERCENT] [WITH TIES]]
FROM <table_name>...[ORDER BY...]
DELETE [TOP (<expression>) [PERCENT]] FROM <table_name> ...
UPDATE [TOP (<expression>) [PERCENT]] <table_name> SET ...
INSERT [TOP (<expression>) [PERCENT]] INTO <table_name> ...

The numeric expression must be specified in parentheses. Specifying constants without parentheses is supported in SELECT queries only for backward compatibility. The expression must be self-contained—if you use a subquery, it cannot refer to columns of the table in the outer query. If you do not specify the PERCENT option, the expression must be implicitly convertible to the bigint data type. If you specify the PERCENT option, the expression must be implicitly convertible to float and fall within the range of 0 to 100. The WITH TIES option and the ORDER BY clause are supported only with SELECT queries.

The following code, for example, uses a variable as an argument to the TOP option, returning the specified number of most recent purchase orders:

USE AdventureWorks
DECLARE @n AS BIGINT
SET @n = 2
SELECT TOP(@n) *
FROM Purchasing.PurchaseOrderHeader
ORDER BY OrderDate DESC

This enhancement is especially useful when you get the number of requested rows as an argument to a stored procedure or a user-defined function. By using a self-contained subquery, you can answer dynamic requests such as "calculate the average number of monthly orders and return that many orders which are most recent:"

USE AdventureWorks
SELECT TOP(SELECT
             COUNT(*)/DATEDIFF(month, MIN(OrderDate), MAX(OrderDate))
           FROM Purchasing.PurchaseOrderHeader) *
FROM Purchasing.PurchaseOrderHeader
ORDER BY OrderDate DESC

The SET ROWCOUNT option in earlier versions of SQL Server allows you to limit the number of rows affected by a query. For example, SET ROWCOUNT is commonly used to periodically purge large amounts of data in several small transactions instead of a single large transaction:

SET ROWCOUNT 1000
DELETE FROM BigTable WHERE datetimecol < '20000101'
WHILE @@rowcount > 0
  DELETE FROM BigTable WHERE datetimecol < '20000101'
SET ROWCOUNT 0

Using SET ROWCOUNT in this way allows the backing up and recycling of the transaction log during the purging process, and can also prevent lock escalation. Instead of using SET ROWCOUNT, you can now use TOP this way:

DELETE TOP(1000) FROM BigTable WHERE datetimecol < '20000101'
WHILE @@rowcount > 0
  DELETE TOP(1000) FROM BigTable WHERE datetimecol < '20000101'

When you use the TOP option, the optimizer can tell what the "row-goal" is and whether TOP is used at all, allowing the optimizer to produce more efficient plans.

Although you might think that using TOP in INSERT statements is not required because you can always specify it in the SELECT query, you might find it useful when inserting the result of an EXEC command or the result of a UNION operation, for example:

INSERT TOP ... INTO ...
  EXEC ...
INSERT TOP ... INTO ...
  SELECT ... FROM T1
  UNION ALL
  SELECT ... FROM T2
  ORDER BY ...

DML with Results

SQL Server 2005 introduces a new OUTPUT clause that allows you to return data from a modification statement (INSERT, UPDATE, DELETE) into a table variable. Useful scenarios for DML with results include purging and archiving, message-processing applications, and others as well. The syntax of the new OUTPUT clause is:

OUTPUT <dml_select_list> INTO @table_variable

You access the old/new image of the modified rows by referring to inserted and deleted tables similar to the way you do with triggers. In an INSERT statement, you are only allowed to access the inserted table. In a DELETE statement, you are only allowed to access the deleted table. In an UPDATE statement, you are allowed to access both the inserted and the deleted tables.

As an example of a purging and archiving scenario where DML with results can be useful, suppose you have a large Orders table, and that you periodically want to purge historic data. You also want to copy the purged data into an archive table called OrdersArchive. You declare a table variable called @DeletedOrders and enter a loop in which you delete historic data (say orders earlier than 2003) in chunks using the purging method described in the "TOP Enhancements" section earlier in this paper. The addition here is the OUTPUT clause which copies all attributes of all deleted rows into the @DeletedOrders table variable, and then, using an INSERT INTO statement, copies all rows from the table variable into the OrdersArchive table:

DECLARE @DeletedOrders TABLE
(
  orderid   INT,
  orderdate DATETIME,
  empid     INT,
  custid    VARCHAR(5),
  qty       INT
)
WHILE 1=1

BEGIN
  BEGIN TRAN
    DELETE TOP(5000) FROM Orders
      OUTPUT deleted.* INTO @DeletedOrders
    WHERE orderdate < '20030101'
    INSERT INTO OrdersArchive
      SELECT * FROM @DeletedOrders
  COMMIT TRAN
  DELETE FROM @DeletedOrders
  IF @@rowcount < 5000
    BREAK
END

As an example of a message-processing scenario, consider the following Messages table:

USE tempdb
CREATE TABLE Messages
(
  msgid   INT          NOT NULL IDENTITY ,
  msgdate DATETIME     NOT NULL DEFAULT(GETDATE()),
  msg     VARCHAR(MAX) NOT NULL,
  status  VARCHAR(20)  NOT NULL DEFAULT('new'),
  CONSTRAINT PK_Messages 
    PRIMARY KEY NONCLUSTERED(msgid),
  CONSTRAINT UNQ_Messages_status_msgid 
    UNIQUE CLUSTERED(status, msgid),
  CONSTRAINT CHK_Messages_status
    CHECK (status IN('new', 'open', 'done'))
)

For each message, you store a message ID, entry date, message text, and a status indicating whether the message wasn't processed yet ("new"), is being processed ("open"), or is already processed ("done").

The following code simulates a session that generates messages by using a loop that inserts a message with random text every second. The status of newly inserted messages is "new" since the status column was assigned with the default value 'new.' Run this code from multiple sessions at the same time:

USE tempdb
SET NOCOUNT ON
DECLARE @msg AS VARCHAR(MAX)
WHILE 1=1
BEGIN
  SET @msg = 'msg' + RIGHT('000000000'
    + CAST(CAST(RAND()*2000000000 AS INT)+1 AS VARCHAR(10)), 10)
  INSERT INTO dbo.Messages(msg) VALUES(@msg)
  WAITFOR DELAY '00:00:01';
END

The following code simulates a session that processes messages using the following steps:

  1. Form an endless loop that constantly processes messages.
  2. Lock one available new message using an UPDATE TOP(1) statement with the READPAST hint to skip locked rows, and change its status to "open."
  3. Store the message attributes in the @Msgs table variable using the OUTPUT clause.
  4. Process the message.
  5. Set the message status to "done" by joining the Messages table and the @Msgs table variable.
  6. If no new message was found in the Messages table, wait for one second.

Run this code from multiple sessions:

USE tempdb
SET NOCOUNT ON
DECLARE @Msgs TABLE(msgid INT, msgdate DATETIME, msg VARCHAR(MAX))
WHILE 1 = 1
BEGIN
  UPDATE TOP(1) Messages WITH(READPAST) SET status = 'open'
    OUTPUT inserted.msgid, inserted.msgdate, inserted.msg
    INTO @Msgs
  WHERE status = 'new'
  IF @@rowcount > 0
  BEGIN
    PRINT 'Processing message...'

    -- process message here
    SELECT * FROM @msgs
    UPDATE M
      SET status = 'done'
    FROM Messages AS M
      JOIN @Msgs AS N
        ON M.msgid = N.msgid;
    DELETE FROM @Msgs
  END
  ELSE
  BEGIN
    PRINT 'No messages to process.'
    WAITFOR DELAY '00:00:01'
  END
END

Once you are done running the simulation, stop all sessions that insert and process messages and drop the Messages table:

USE tempdb
DROP TABLE Messages

MAX Specifier for Dynamic Columns

SQL Server 2005 enhances the capabilities of the variable-length data types VARCHAR, NVARCHAR, and VARBINARY by introducing the MAX specifier using the syntax <datatype>(MAX). A variable-length data type with the MAX specifier substitutes the data types TEXT, NTEXT, and IMAGE with enhanced functionality. There are several advantages of using a variable-length data type with the MAX specifier as a substitute for the large object data types TEXT, NTEXT, and IMAGE. There's no need for explicit pointer manipulation because SQL Server internally determines when to store values inline and when to use a pointer. You are now able to use a unified programming model for small and large data. Variable-length data types with the MAX specifier are supported by columns, variables, parameters, comparisons, triggers, all string functions, etc.

As an example of using the MAX specifier, the following code creates a table called CustomerData:

CREATE TABLE CustomerData
(
  custid INT  NOT NULL PRIMARY KEY,
  txt_data    VARCHAR(MAX)   NULL,
  ntxt_data   NVARCHAR(MAX)  NULL,

  binary_data VARBINARY(MAX) NULL
)

The table contains the column custid, which is used as the primary key, and the nullable columns txt_data, ntxt_data, and binary_data defined with the data types VARCHAR(MAX), NVARCHAR(MAX), and VARBINARY(MAX) respectively where you can store large data.

In order to read a chunk from a dynamic column with the MAX specifier, you use the SUBSTRING function the same way you do with regular dynamic columns. In order to update a chunk, you use an enhanced syntax of the UPDATE statement, which now provides you with a WRITE method. The syntax of the enhanced UPDATE statement is:

UPDATE table_name
SET column_name.WRITE(@chunk, @offset, @len)
WHERE ...

The WRITE method removes @len characters from @offset position, and inserts @chunk at that position. Note that @offset is zero-based, meaning that offset 0 represents the position of the first character in @chunk. To demonstrate the use of the WRITE method, first insert a row in the CustomerData table, with customer ID 102, and the value 'Customer 102 text data' in the txt_data column:

INSERT INTO CustomerData(custid,txt_data)
  VALUES(102, 'Customer 102 text data')

The following UPDATE statement replaces '102' with 'one hundred and two':

UPDATE CustomerData
  SET txt_data.WRITE('one hundred and two', 9, 3)
WHERE custid = 102

When @chunk is NULL, @len is ignored, and value is truncated at the @offset position. The following statement removes all data from offset 28 till the end:

UPDATE CustomerData
  SET txt_data.WRITE(NULL, 28, 0)
WHERE custid = 102

When @len is NULL, all characters from @offset to the end are removed, and @chunk is appended. The following statement removes all data from offset 9 till the end and appends '102':

UPDATE CustomerData
  SET txt_data.WRITE('102', 9, NULL)
WHERE custid = 102

When @offset is NULL, @len is ignored, and @chunk is appended at the end. The following statement appends the string ' is discontinued' at the end:

UPDATE CustomerData
  SET txt_data.WRITE(' is discontinued', NULL, 0)
WHERE custid = 102

XML and XQuery

SQL Server 2005 Beta 2 introduces several XML-related enhancements that allow you to store, query, and update XML-structured data natively. You can store both XML and relational data in the same database, leveraging the existing database engine for storage and query processing.

A new xml data type is introduced. The xml data type can be used for table columns and can even be indexed. The xml data type can also be used in variables, views, functions, and stored procedures. The xml data type can be generated by relational FOR XML queries or accessed as a relational rowset using OPENXML. You can import schemas into your database or export schemas out of it. You can use schemas to validate and constrain your XML data. You can query and modify XML-typed data by using XQuery. The xml data type is supported in triggers, replication, bulk copy, DBCC, and full-text search. However, xml is not comparable, meaning that you cannot define a PRIMARY KEY, UNIQUE, or FOREIGN KEY constraint on an xml column.

The following examples use the xml data type. The following code defines an XML variable called @x and loads customer order data into it:

USE AdventureWorks
DECLARE @x AS XML
SET @x = (SELECT C.CustomerID, O.SalesOrderID 
          FROM Sales.Customer C 
            JOIN Sales.SalesOrderHeader O 
              ON C.CustomerID=O.CustomerID 
          ORDER BY C.CustomerID
          FOR XML AUTO, TYPE)
SELECT @x

The following code creates a table with an xml column and bulk loads an XML file into it by using the OPENROWSET function:

CREATE TABLE T1
(
  keycol INT NOT NULL PRIMARY KEY,
  xmldoc XML NULL
)

INSERT INTO T1(keycol, xmldoc)
  SELECT 1 AS keycol, xmldoc
  FROM OPENROWSET(BULK 'C:\documents\mydoc.xml', SINGLE_NCLOB)
       AS X(xmldoc)

SQL Server 2005 Beta 2 also introduces support for XQuery, which is a W3C standard XML query language. Microsoft provides extensions to the standard in SQL Server that allow using XQuery for inserts, updates, and deletes. XQuery is embedded with Transact-SQL by means of user-defined type (UDT) style methods.

XQuery provides the following querying methods:

  • Manipulating XML data: @x.query (xquery string) returning XML
  • Checking existence: @x.exist (xquery string) returning bit
  • Returning a scalar value @x.value (xquery string, sql_type string) returning sql_type

XQuery provides the following modification method: @x.modify (xDML string).

As an example, a table called Jobs contains XML-formatted job information in a column called jobinfo. The following query returns the ID and XML data after some manipulation for every row that meets some criteria. The method jobinfo.exist() is invoked in the WHERE clause to filter the rows you want. It returns 1 only for rows whose jobinfo column contains the edited element with a date attribute that is greater than the Transact-SQL variable @date. For each returned row, an XML result is generated by invoking the jobinfo.query() method. For every job element found in jobinfo, the query() method generates a jobschedule element, with an id attribute based on the id attribute of the job, and begin and end subelements with data based on the start and end attributes from jobinfo:

SELECT id, jobinfo.query(
  'for $j in //job 
    return
      <jobschedule id="{$j/@id}">      
        <begin>{data($j/@start)}</begin>
        <end>{data($j/@end)}</end>
      </jobschedule>')
FROM Jobs
WHERE 1 = jobinfo.exist(
  '//edited[@date > sql:variable("@date")]')

The following invocation of the value() method returns the start attribute of the first job in jobinfo in a Transact-SQL datetime format:

SELECT id, jobinfo.value('(//job)[1]/@start', 'DATETIME') AS startdt
FROM Jobs
WHERE id = 1

XQuery can also be used to modify data. For example, you use the following code to update the empinfo XML column in the Employees table for employee 1. You update the date attribute of the edited subelement of the resume element to a new value:

UPDATE Employees SET empinfo.modify(
   'update /resume/edited/@date 
    to xs:date("2000-6-20")')
WHERE empid = 1

DDL Triggers

In earlier versions of SQL Server, you can define AFTER triggers only for DML statements (INSERT, UPDATE, and DELETE) issued against a table. SQL Server 2005 Beta 2 allows you to define triggers for DDL events with a scope of the entire server or database. You can define a DDL trigger for an individual DDL statement such as CREATE_TABLE, or for a group of statements such as DDL_DATABASE_LEVEL_EVENTS. Within the trigger, you can get data regarding the event that fired it by accessing the eventdata() function. This function returns XML data about the event. The schema for each event inherits the Server Events base schema.

Event information includes:

  • When the event took place.
  • The SPID from which the event was issued.
  • The type of event.
  • The affected object.
  • SET options.
  • The Transact-SQL statement that fired it.

Similar to triggers in earlier versions of SQL Server, DDL triggers run in the context of the transaction that fired them. If you decide to undo the event that fired the trigger, you can issue a ROLLBACK statement. For example, the following trigger prevents new tables from being created in the current database:

CREATE TRIGGER trg_capture_create_table ON DATABASE FOR CREATE_TABLE
AS
-- PRINT event information For DEBUG
PRINT 'CREATE TABLE Issued'
PRINT EventData()
-- Can investigate data returned by EventData() and react accordingly.
RAISERROR('New tables cannot be created in this database.', 16, 1)
ROLLBACK
GO

If you issue a CREATE TABLE statement within the database in which the trigger was created, you should get the following output:

CREATE TABLE T1(col1 INT)
CREATE TABLE Issued
<EVENT_INSTANCE>
  <PostTime>2003-04-17T13:55:47.093</PostTime>
  <SPID>53</SPID>
  <EventType>CREATE_TABLE</EventType>
  <Database>testdb</Database>
  <Schema>dbo</Schema>
  <Object>T1</Object>
  <ObjectType>TABLE</ObjectType>
  <TSQLCommand>
    <SetOptions ANSI_NULLS="ON" ANSI_NULL_DEFAULT="ON" ANSI_PADDING="ON"
                QUOTED_IDENTIFIER="ON" ENCRYPTED="FALSE" />
    <CommandText>CREATE TABLE T1(col1 INT)</CommandText>
  </TSQLCommand>
</EVENT_INSTANCE>

.Net SqlClient Data Provider: Msg 50000, Level 16, State 1, Procedure trg_capture_create_table, Line 10
New tables cannot be created in this database.
.Net SqlClient Data Provider: Msg 3609, Level 16, State 1, Line 1
Transaction ended in trigger. Batch has been aborted.

Note that the XML output was manually formatted in this paper for readability purposes. You would get unformatted XML output when you run this code.

To drop the trigger, issue the following statement:

DROP TRIGGER trg_capture_create_table ON DATABASE

Scenarios where DDL triggers are especially useful include integrity checks for DDL changes and auditing scenarios, among others. As an example of DDL integrity enforcement, the following database-level trigger rejects attempts to create a table without a primary key:

CREATE TRIGGER trg_create_table_with_pk ON DATABASE FOR CREATE_TABLE
AS
DECLARE @eventdata AS XML, @objectname AS NVARCHAR(257),
  @msg AS NVARCHAR(500)

SET @eventdata = eventdata()
SET @objectname = 
  N'[' + CAST(@eventdata.query('data(//SchemaName)') AS SYSNAME)
  + N'].[' + 
  CAST(@eventdata.query('data(//ObjectName)') AS SYSNAME) + N']'
IF OBJECTPROPERTY(OBJECT_ID(@objectname), 'TableHasPrimaryKey') = 0
BEGIN
  SET @msg = N'Table ' + @objectname + ' does not contain a primary key.'
    + CHAR(10) + N'Table creation rolled back.'
  RAISERROR(@msg, 16, 1)
  ROLLBACK
  RETURN
END

The trigger is fired when a CREATE TABLE statement is issued. Using XQuery, the trigger extracts the schema and object names and, using the OBJECTPROPERTY function, checks whether the table contains a primary key. If it doesn't, the trigger generates an error and rolls back the transaction. After creating the trigger, the following attempt to create a table without a primary key fails:

CREATE TABLE T1(col1 INT NOT NULL)
Msg 50000, Level 16, State 1, Procedure trg_create_table_with_pk, Line 19
Table [dbo].[T1] does not contain a primary key.
Table creation rolled back.
Msg 3609, Level 16, State 2, Line 1
Transaction ended in trigger. Batch has been aborted.

And the following succeeds:

CREATE TABLE T1(col1 INT NOT NULL PRIMARY KEY)

To drop the trigger and the table T1, run the following code:

DROP TRIGGER trg_create_table_with_pk ON DATABASE
DROP TABLE T1

As an example of an auditing trigger, the following database-level trigger audits all DDL statements to the AuditDDLEvents table:

CREATE TABLE AuditDDLEvents
(
  LSN              INT      NOT NULL IDENTITY,
  posttime         DATETIME NOT NULL,
  eventtype        SYSNAME  NOT NULL,

  loginname        SYSNAME  NOT NULL,
  schemaname       SYSNAME  NOT NULL,
  objectname       SYSNAME  NOT NULL,
  targetobjectname SYSNAME  NOT NULL,
  eventdata        XML      NOT NULL,
  CONSTRAINT PK_AuditDDLEvents PRIMARY KEY(LSN)
)
GO
CREATE TRIGGER trg_audit_ddl_events ON DATABASE FOR DDL_DATABASE_LEVEL_EVENTS
AS
DECLARE @eventdata AS XML
SET @eventdata = eventdata()
INSERT INTO dbo.AuditDDLEvents(
  posttime, eventtype, loginname, schemaname, 
  objectname, targetobjectname, eventdata)
  VALUES(
    CAST(@eventdata.query('data(//PostTime)')         AS VARCHAR(23)),
    CAST(@eventdata.query('data(//EventType)')        AS SYSNAME),
    CAST(@eventdata.query('data(//LoginName)')        AS SYSNAME),
    CAST(@eventdata.query('data(//SchemaName)')       AS SYSNAME),
    CAST(@eventdata.query('data(//ObjectName)')       AS SYSNAME),
    CAST(@eventdata.query('data(//TargetObjectName)') AS SYSNAME),
    @eventdata)
GO

The trigger simply extracts all event attributes of interest from the eventdata() function using XQuery, and inserts those into the AuditDDLEvents table. To test the trigger, submit a few DDL statements and query the audit table:

CREATE TABLE T1(col1 INT NOT NULL PRIMARY KEY)
ALTER TABLE T1 ADD col2 INT NULL
ALTER TABLE T1 ALTER COLUMN col2 INT NOT NULL
CREATE NONCLUSTERED INDEX idx1 ON T1(col2)
SELECT * FROM AuditDDLEvents

To check who changed the schema of table T1 in the last 24 hours and how they changed it, run the following query:

SELECT posttime, eventtype, loginname,
  CAST(eventdata.query('data(//TSQLCommand)') AS NVARCHAR(2000))
  AS tsqlcommand
FROM dbo.AuditDDLEvents
WHERE schemaname = N'dbo' AND N'T1' IN(objectname, targetobjectname)
ORDER BY posttime

To drop the trigger and the tables you just created, run the following code:

DROP TRIGGER trg_audit_ddl_events ON DATABASE
DROP TABLE dbo.T1
DROP TABLE dbo.AuditDDLEvents

As an example of a server-level audit trigger, the following trigger audits all DDL login-related events to an audit table called AuditDDLLogins:

USE master
CREATE TABLE dbo.AuditDDLLogins
(
  LSN              INT      NOT NULL IDENTITY,
  posttime         DATETIME NOT NULL,
  eventtype        SYSNAME  NOT NULL,
  loginname        SYSNAME  NOT NULL,
  objectname       SYSNAME  NOT NULL,
  logintype        SYSNAME  NOT NULL,
  eventdata        XML      NOT NULL,
  CONSTRAINT PK_AuditDDLLogins PRIMARY KEY(LSN)
)
CREATE TRIGGER audit_ddl_logins ON ALL SERVER
  FOR DDL_LOGIN_EVENTS
AS
DECLARE @eventdata AS XML
SET @eventdata = eventdata()
INSERT INTO master.dbo.AuditDDLLogins(
  posttime, eventtype, loginname,
  objectname, logintype, eventdata)

  VALUES(
    CAST(@eventdata.query('data(//PostTime)')         AS VARCHAR(23)),
    CAST(@eventdata.query('data(//EventType)')        AS SYSNAME),
    CAST(@eventdata.query('data(//LoginName)')        AS SYSNAME),
    CAST(@eventdata.query('data(//ObjectName)')       AS SYSNAME),
    CAST(@eventdata.query('data(//LoginType)')        AS SYSNAME),
    @eventdata)
GO

To test the trigger, issue the following DDL login statements which create, alter, and drop a login, then query the audit table:

CREATE LOGIN login1 WITH PASSWORD = '123'
ALTER LOGIN login1 WITH PASSWORD = 'xyz'
DROP LOGIN login1
SELECT * FROM AuditDDLLogins

To drop the trigger and the audit table, run the following code:

DROP TRIGGER audit_ddl_logins ON ALL SERVER
DROP TABLE dbo.AuditDDLLogins
DROP DATABASE testdb

DDL and System Event Notifications

SQL Server 2005 Beta 2 allows you to capture DDL and system events and send an event notification to a Service Broker deployment. Whereas triggers are processed synchronously, event notifications are an event delivery mechanism that allows asynchronous consumption. An event notification sends XML data to a specified Service Broker service, and event consumers consume it asynchronously. An event consumer can wait for new data to arrive using extensions to the WAITFOR clause.

An event notification is defined by:

  • Scope (SERVER, DATABASE, ASSEMBLY, individual object)
  • List of events or event groups (for example, CREATE_TABLE, DDL_EVENTS, and so on)
  • Deployment name that implements the SQL Server Events message type and contract

Event data is sent in XML format using the SQL Server Events schema. The general syntax for creating an event notification is:

CREATE EVENT NOTIFICATION <name>
  ON <scope>
  FOR <list_of_event_or_event_groups>
  TO SERVICE <deployment_name>

When an event notification is created, a Service Broker conversation is established between a system deployment and the deployment specified by the user. The <deployment_name> specifies the Service Broker with which SQL Server opens a conversation to deliver data about an event. The deployment specified must implement the SQL Server Events message type and contract. When an event occurs for which an event notification exists, an XML message is constructed from the pertinent event data and sent through the event notification's conversation to the specified deployment.

For example, the following code creates a table called T1 and defines an event notification that sends a notice to a certain deployment every time the T1 table's schema is altered:

CREATE TABLE dbo.T1(col1 INT);
GO
-- Create a queue.
CREATE QUEUE SchemaChangeQueue;
GO
--Create a service on the queue that references
--the event notifications conract.
CREATE SERVICE SchemaChangeService
  ON QUEUE SchemaChangeQueue
(
[//s.ms.net/SQL/Notifications/PostEventNotification/v1.0]
);
GO
--Create a route on the service to define the address 
--to which Service Broker sends messages for the service.
CREATE ROUTE SchemaChangeRoute
  WITH SERVICE_NAME = 'SchemaChangeService',
  ADDRESS = 'LOCAL';
GO
--Create the event notification.
CREATE EVENT NOTIFICATION NotifySchemaChangeT1
  ON TABLE dbo.T1
  FOR ALTER_TABLE TO SERVICE [SchemaChangeService];

The following ALTER would then cause an XML message to be sent to SchemaChangeService, which is built on the SchemaChangeQueue:

ALTER TABLE dbo.T1 ADD col2 INT;

The XML message could then be retrieved from the queue with the following statement:

RECEIVE TOP (1) CAST(message_body AS nvarchar(MAX)) 
FROM SchemaChangeQueue

The resulting output would be as follows (minus the formatting):

<EVENT_INSTANCE>
<PostTime>2004-06-15T11:16:32.963</PostTime>
<SPID>55</SPID>
<EventType>ALTER_TABLE</EventType>
<ServerName>MATRIX\S1</ServerName>
<LoginName>MATRIX\Gandalf</LoginName>
<UserName>MATRIX\Gandalf</UserName>
<DatabaseName>testdb</DatabaseName>
<SchemaName>dbo</SchemaName>
<ObjectName>T1</ObjectName>
<ObjectType>TABLE</ObjectType>
<TSQLCommand>
<SetOptions ANSI_NULLS="ON" ANSI_NULL_DEFAULT="ON" ANSI_PADDING="ON" QUOTED_IDENTIFIER="ON" ENCRYPTED="FALSE" />
<CommandText>ALTER TABLE dbo.T1 ADD col2 INT;</CommandText>
</TSQLCommand>
</EVENT_INSTANCE>

The WAITFOR statement can be used to receive notifications in blocking mode as follows:

WAITFOR (RECEIVE * FROM myQueue)

Conclusion

The Transact-SQL enhancements in SQL Server 2005 Beta 2 increase your expressive powers in query writing, allow you to improve the performance of your code, and extend your error management capabilities. The continuous effort that is being put into enhancing Transact-SQL shows a firm belief in its significant role in SQL Server, its power, and in its future.