This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

Trees in SQL Server

R.L. Parker

Many applications—both commercial and homegrown—have to deal with hierarchical, tree-structured data. However, because of their recursive nature, trees can be intimidating to model, program, and display. While SQL Server doesn't provide native support for hierarchical data (like Oracle does via its START WITH…CONNECT BY clause, for example), it's not that hard to add. R.L. Parker explains.

If you're not exactly sure what I mean by hierarchical, tree-structured data, consider the Northwind database's Employees table. In it, a Northwind employee's boss is identified by a value in the ReportsTo column. That makes it easy to find out who reports to a given manager. If Mary Manager's EmployeeID is "1," then her direct reports are identified by:

  SELECT FirstName, LastName, EmployeeID 
FROM Employee, 
WHERE ReportsTo =  1

But Mary might be a mid-level manager. Suppose you also want to know who reports to the people who report to her? And the people who report to them, and so on, recursively, until you get to the bottom of the tree where Dilbert lives? Hold that thought—we'll come back to it.

Figure 1 shows part of the logical model for the Northwind database. The Employees table has a recursive relationship to itself because the ReportsTo column has a foreign-key relationship to the EmployeeID column. Some people call this kind of relationship in a model a "fishhook."

Another way of describing this situation is that each row in Employees might have many related child rows (the row's direct reports) in the Employees table. And, of course, one of the child rows might have, in turn, many related children of its own. This kind of recursive, hierarchical data is often depicted as a tree structure, as shown in the org chart in Figure 2.

In this article, we're going to develop a simple application that shows how to model, program, and display this kind of data. But first, here's a quick review of the terminology and some rules about data trees:

  1. An element of data in the tree is called a node.
  2. A node has either one parent or none.
  3. If a node doesn't have a parent, it's called a root node.
  4. If a node has a parent, the node is referred to as the child of the parent node.
  5. A node can have an arbitrary number of children.
  6. A node can't be the parent of itself.

One of the things we'll need to do in our database design is make sure that these rules are observed.

The application

The application we're going to implement is designed to track and display a list of tasks. Each task has a name, a description, a type, and a status. A given task can have zero or more subtasks.

The display requirement is that, for a given task, the application must be able to show the task and its related task tree. Figure 3 shows a possible implementation of the solution, implemented as an ASP.NET application, and Figure 4 shows a logical model for a table that will support our Task application. It should look familiar, as it follows the same pattern that the Employees table does.

Listing 1 includes the script for the Task table, including its constraints.

Listing 1. The Task table.

  CREATE TABLE [dbo].[Task] (
  [ID] [int] IDENTITY (1000, 1) NOT NULL, 
  [Name] [varchar] (16) 
  COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
  [Description] [varchar] (128) 
  COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
  [Type] [tinyint] NOT NULL,
  [Status] [tinyint] NOT NULL,
  [ParentID] [int] NULL --line 9
 ) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Task] WITH NOCHECK ADD 
CONSTRAINT [DF_Task_Description] DEFAULT ('None') 
FOR [Description],
CONSTRAINT [DF_Task_Type] DEFAULT (0) FOR [Type],
CONSTRAINT [DF_Task_Status] DEFAULT (0) FOR [Status],
CONSTRAINT [PK_Task] PRIMARY KEY  CLUSTERED 
([ID]) ON [PRIMARY] ,
CONSTRAINT [CK_Task] CHECK ([ID] <> [ParentID]), 
CONSTRAINT [CK_Task_1] CHECK ([Type] >= 0 
and [Type] <= 5),
CONSTRAINT [CK_Task_2] CHECK ([Status] >= 0 
and [Status] <= 2)
GO
ALTER TABLE [dbo].[Task] ADD 
CONSTRAINT [FK_Task_Task] FOREIGN KEY 
 ([ParentID]) REFERENCES [dbo].[Task] ([ID])
GO

Remember the rules about trees? Rule #2 is enforced by the combination of line 9, which allows a node's ParentID to be NULL, and the ALTER TABLE statement at the end that defines a foreign key on the ParentID column. If there's a value in the ParentID column, it must be the primary key of a row in the table. But not just any row—rule #6 says that a node can't be a parent of itself. If the data violated rule #6, it would lead to infinite recursion! The ParentID constraint defined ensures that this doesn't happen.

An iterative solution

Figure 5 shows the raw data for our Task application. All the information that an application needs is there—but it's not in a very useful format!

The main problem with the raw data is that there's no guarantee about the ordering of the nodes. The requirement that the application imposes on the data is simple—the application must see each parent node before seeing any of its children. And, although there's no way to construct a SELECT statement's ORDER BY clause that will guarantee that, we can write a stored procedure and a helper function that will do the job. Listing 2 and Listing 3 define the FindRoot function and BuildTree stored proc, respectively.

Listing 2. The FindRoot function.

  CREATE FUNCTION dbo.FindRoot(@id int)
RETURNS int
AS  
BEGIN 
  DECLARE @parentID int
  SELECT @parentID = parentID
  FROM Task
  WHERE id = @id
  WHILE @parentID <> NULL
    BEGIN
      SELECT @id = @parentID
      SELECT @parentID = parentID
      FROM Task
      WHERE id = @id
    END
  RETURN @id
END

Listing 3. The BuildTree stored procedure.

  CREATE PROCEDURE BuildTree(@id int)
AS 
SET NOCOUNT ON
CREATE TABLE #results(level tinyint, id int, 
name varchar(16), description varchar(128), 
type tinyint, status tinyint, parentID int)
DECLARE @name varchar(16)
DECLARE @description varchar(128)
DECLARE @type tinyint
DECLARE @status tinyint
DECLARE @parentID int
DECLARE @level tinyint
SELECT @level = 1
DECLARE @root int
SELECT @root = dbo.FindRoot(@id)
CREATE TABLE #stack (id int, level smallint)
INSERT INTO #stack VALUES (@root, @level)
WHILE @level > 0
BEGIN
IF EXISTS (SELECT * FROM #stack WHERE level = @level)
  BEGIN
    SELECT @id = s.id, @name = t.name, @description = 
    t.description, @type = t.type, @status = t.status,
    @parentID = IsNull(t.parentID, 0)
    FROM #stack s INNER JOIN Task t
    ON p.id = s.id
    WHERE level = @level
    INSERT INTO #results VALUES (@level, @id, @name, 
    @description, @type, @status, @parentID)
    DELETE FROM #stack
    WHERE level = @level
    AND id = @id
    INSERT #stack
    SELECT id, @level + 1
    FROM Task
    WHERE parentID = @id
    IF @@ROWCOUNT > 0 
      BEGIN
        SELECT @level = @level + 1
    END
 END--IF EXISTS
 ELSE
 BEGIN
   SELECT @level = @level - 1
 END
END -- WHILE
SELECT * FROM #results
ORDER BY level, name 

FindRoot

FindRoot is pretty straightforward—given a primary key for row x in the Task table, it will return the primary key for the root node of the tree that x is in. If the node's ParentID is NULL, then the node is the root, so FindRoot returns x's ID. Otherwise, FindRoot iteratively travels up x's ParentID links until it gets to the root of the tree.

BuildTree

Given the primary key for a row in the Task table, BuildTree will return a resultset that lists the node's entire task family, beginning with the root node of the tree. Each parent node is listed in the resultset before any of its child nodes. Just what the application needs!

BuildTree uses two temporary tables and the FindRoot function. Temp table #stack is used as a LIFO stack that holds the intermediate results as each node is processed. I start by calling the FindRoot helper function and push the root node onto the stack. Then, in the WHILE loop, I pick a row from #stack. After adding the selected row to the resultset, I delete it from the stack, and then push all of its children onto the stack. When the WHILE loop terminates, I've processed all the parent nodes. I return the results to the client by selecting all the rows in #results. I reuse level in the ORDER BY clause to ensure that all parent rows are returned before any of their children.

A set-oriented solution

BuildTree is based on code that you'll find if you type "hierarchical information" in the index of SQL Server's Books Online, but it isn't very efficient. It processes one row in the #stack table for each trip through the WHILE loop.

I wrote another stored procedure called BuildTreeNew that takes a more set-oriented approach to solving the problem. It goes through the WHILE loop once for each level in the tree, processing all nodes at that level simultaneously. It's simpler than BuildTree too, disposing of the #stack table altogether. Instead, it reuses the #results table for intermediate results as it walks down the tree. You'll find it in this month's Download file.

An XML-based solution?

What about XML? XML documents can represent hierarchical data by nesting the elements in the document. Since SQL Server 2000 provides the ability to return data as an XML document instead of a recordset by using the FOR XML clause on the SELECT statement, maybe XML is the way to go?

Using the FOR XML AUTO mode, SQL Server will generate a hierarchical tree. Sounds promising! But, unfortunately for the Task application, the structure of the hierarchy is based on the table names in the SELECT statement. To show you what I mean, I'm going to create a couple of tables that have a simple parent-child relationship and load the tables with a few rows of data (see Listing 4).

Listing 4. A simple parent-child relationship.

  CREATE TABLE [dbo].[Parent] ([ID] [int] 
IDENTITY (1, 1) NOT NULL, [Name] [varchar] (8) COLLATE
SQL_Latin1_General_CP1_CI_AS NOT NULL) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Child] ([ID] [int] 
IDENTITY (1, 1) NOT NULL, {Name] [varchar] (8) COLLATE
SQL_Latin1_General_CP1_CI_AS NOT NULL, 
[ParentID] [int] NULL) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Child] WITH NOCHECK ADD 
  CONSTRAINT [PK_Child] PRIMARY KEY  CLUSTERED 
  ([ID])  ON [PRIMARY] 
GO
ALTER TABLE [dbo].[Parent] WITH NOCHECK ADD 
  CONSTRAINT [PK_Parent] PRIMARY KEY  CLUSTERED 
  ([ID])  ON [PRIMARY] 
GO
insert into Parent values ("P1")
insert into Parent values ("P2")
insert into Parent values ("P3")
insert into Child values ("P1.1",1)
insert into Child values ("P2.1",2)
insert into Child values ("P1.2",1)
insert into Child values ("P3.1",3)
insert into Child values ("P3.2",3)
insert into Child values ("P3.3",3)

Given such a parent-child relationship, here's the SQL that will produce an XML document instead of a resultset.

  SELECT p.name, c.name
FROM Parent p INNER JOIN Child c
ON p.id = c.parentID
ORDER BY p.id
FOR XML AUTO

Here's the resulting XML. I formatted it for the sake of easier readability.

  <p name="P1">
  <c name="P1.1"/>
  <c name="P1.2"/>
</p>
<p name="P2">
  <c name="P2.1"/>
</p>
<p name="P3">
  <c name="P3.1"/>
  <c name="P3.2"/>
  <c name="P3.3"/>
</p>

Pretty cool, but it doesn't really solve my problem for the Task application because I only have one table to select from!

There's another FOR XML alternative: FOR XML EXPLICIT. To use XML EXPLICIT, two special columns must be included in the resultset, and the data columns must be aliased in a special way so that FOR XML EXPLICIT will know how to structure the output. A full explanation of FOR XML EXPLICIT is beyond the scope of this article. But, like FOR XML AUTO, FOR XML EXPLICIT is intended for conventional parent-child relationships rather than for arbitrarily deep tree structures, so it probably isn't the solution I need for the Task application.

Conclusion

Data is sometimes hierarchical in nature. It pays to be able to model, program, and display hierarchical data structures. Although SQL Server 2000 doesn't provide native support for hierarchical data, it's relatively easy to support it with the help of a user-defined function and a set-oriented stored procedure.

Download TREESSQL.TXT

SQL Server Professional

subscribers may also want to (re)read these articles:

  • Mark Scanlon's August 2000 article, "Hierarchical Data and Scope Checking."
  • Nolan Zak's September 2001 piece, "Hierarchical Data and Scope Checking in Detail."
  • Tom Moreau's March 2001 column, "Feeding XML to a Stored Procedure."
  • Anton Britten's February 2002 article, "Getting the XML Back In."
  • Alexander Yankelevich's November 1999 article, "Climbing the Tree."

I'd also encourage Hardcore Visual Basic subscribers to (re)read Rod Stephens' classic columns:

  • "Tree Trimming Revisited" (June 1998).
  • "The Tree Control," Parts 1 and 2 (December 2000 and January 2001).
  • "XmlTree" (February 2002).

—kw

To find out more about SQL Server Professional and Pinnacle Publishing, visit their website at http://www.pinpub.com/html/main.isx?sub=57

Note: This is not a Microsoft Corporation website. Microsoft is not responsible for its content.

This article is reproduced from the February 2003 issue of SQL Server Professional. Copyright 2003, by Pinnacle Publishing, Inc., unless otherwise noted. All rights are reserved. SQL Server Professional is an independently produced publication of Pinnacle Publishing, Inc. No part of this article may be used or reproduced in any fashion (except in brief quotations used in critical articles and reviews) without prior consent of Pinnacle Publishing, Inc. To contact Pinnacle Publishing, Inc., please call 1-800-493-4867 x4209.