This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

Dr. Tom's Workshop: Multiple-Child Aggregation

Tom Moreau

Aggregation, which is commonly used to deliver reports to users, often also delivers poor performance. To help guarantee that you're delivering the best performance possible, you should explore different versions of the query and pick the best one. This month, look over Tom Moreau's shoulder as he shows you how to work with multiple-child aggregations.

This month's column was inspired by a discussion in one of the public newsgroups that got me to pondering, once again, the fact that there are often many ways to solve a given problem. From the business standpoint, you need to deliver the rows and columns as required in the spec. If you're facing a deadline, chances are you'll crank code, often resorting to a coding method that's worked in the past–most of the time, anyway. Hey, if it executes, that's good. If it gives users the information they want, that's a bonus. Just deliver it.

	I've been in the IT business for a long time, and I've witnessed the "Just deliver it" mentality all too often. I suspect that its proponents don't understand that users are more likely to remember poor quality and/or poor performance long after they've forgotten a missed deadline. Sure, deadlines are important, but reasonable people will usually accommodate a tardy delivery as long as you can demonstrate the benefits of doing so.

Parent-child/child scenario

Let's tackle a common scenario where you have a parent-child relationship with two child tables–Child1 and Child2. You want a report of all of the primary keys from the Parent table, along with counts of the corresponding rows–if any–from Child1 and Child2. You need to accommodate the fact that there may be no matching rows in either of the two child tables, and you also want to make sure you don't double-count. Listing 1 provides the build script for this month's sample. I've included some padding columns so we can obtain reasonably realistic test results.

Listing 1. Build script for Parent, Child1, and Child2.

  create table dbo.Parent
(
 ParentID int primary key, Padding char(500) not null
)
go
create table dbo.Child1
(
 ChildID int primary key nonclustered
 identity, ParentID int not null
 references dbo.Parent, Padding char(500) not null
)
go
create clustered index C_Child1 on Child1 (ParentID)
go
create table dbo.Child2
(
 ChildID int primary key nonclustered identity
, ParentID int not null references dbo.Parent

, Padding  char (500) not null
)
go
create clustered index C_Child2 on Child2 (ParentID)

	On to the data! I pulled out the code for dbo.GenRows(), a table-valued, user-defined function from my November 2003 column [included in this month's download–Ed.] and used dbo.GenRows to populate Parent with 1,000 rows. Child1 and Child2 are next on the agenda, and, again, we want something realistic. It's highly unlikely, for example, that we'd have the exact same number of rows in each of the child tables for the same ParentID, and it would also be cool to have a random distribution of rows per ParentID in each table.

	Since we need to preserve Referential Integrity, we have to include the Parent table in our population script for Child1 and Child2. If a CROSS JOIN comes to mind, congratulations! Take a peek at Listing 2, and then we'll walk through the code.

Listing 2. Population script for the Parent, Child1, and Child2 tables.

  insert dbo.Parent
select RowNum, replicate ('P', 500)
from dbo.GenRows (1000)
go
insert Child1 (ParentID, Padding)
select p.ParentID, replicate ('C1', 250)
from dbo.Parent p
CROSS JOIN dbo.GenRows (100) g
WHERE
  cast (rand (cast (NEWID () as binary(16))) * 100
  as int) + p.ParentID < g.RowNum + p.ParentID
go
insert Child2 (ParentID, Padding)
select p.ParentID, replicate ('C2', 250)
from dbo.Parent p
CROSS JOIN dbo.GenRows (100) g
WHERE
  cast (rand (cast (NEWID () as binary(16)))
 * 100 as int) + p.ParentID < g.RowNum + p.ParentID
and p.ParentID % 33 <> 0
go
dbcc dbreindex ('dbo.Child1')
dbcc dbreindex ('dbo.Child2')

&#9;I used CROSS JOIN between the Parent table and dbo.GenRows()–feeding it the maximum number of rows we want per ParentID. And, to avoid obtaining the same number of rows for every Parent ID, I use filtering.

&#9;Notice the WHERE clauses. The NEWID()s will generate a 16-byte uniqueidentifier for each row, which can then be converted into a binary(16) to serve as a seed value for the RAND() function. (If you fail to provide a seed value for every row, you'll simply get the same value for every row.) We're still not out of the woods, though. RAND() produces a random number between 0 and 1. By multiplying it by the maximum number of rows per ParentID–the same number that was fed to dbo.GenRows()–you get a number that you can compare with the value of RowNumber. If the random number is smaller, you insert the row; if it's not, you don't. This gives you a random distribution of rows per ParentID. Curious why p.ParentID appears on both sides of the < sign? Without it, RAND() would have only been applied for every row in dbo.GenRows()–yielding the same number of rows per ParentID.

&#9;The two child tables are also populated slightly differently. In the case for Child2, I added an extra predicate to the WHERE clause, filtering out about 3 percent of the rows. This ensured that not every ParentID in Parent was represented in the Child2 table.

&#9;Since both child tables were clustered on ParentID, there was definitely going to be some fragmentation after the tables were populated. That's why I ran DBCC DBREINDEX on each of them once they were populated. In a real-world scenario where you have very large tables, it would behoove you to populate them before adding PRIMARY KEY constraints and indexes.

&#9;We're finally good to go. Let's start with a correlated subquery approach. The correlation will be in the SELECT list, and we'll have one for each of the two child tables, as shown in Listing 3.

Listing 3. Rows per ParentID for each child table–correlated subquery.

  SELECT p.ParentID, 
  (SELECT count (*) from dbo.Child1 c1
where c1.ParentID = p.ParentID) ChildCounts1
, (SELECT count (*) from dbo.Child2 c2
where c2.ParentID = p.ParentID) ChildCounts2
from dbo.Parent p order by p.ParentID

&#9;This scans through Parent and for every ParentID and does a correlated subquery against Child1 and Child2 (independently of each other), counting up the corresponding hits from the child tables. In the event that there are no corresponding rows–as seen in Child2 for values of ParentID = 33, 66, 99, and so on–the aggregates return 0.

&#9;Another approach involves LEFT JOINs on each of Child1 and Child2. Check out Listing 4.

Listing 4. Rows per ParentID for each child table–LEFT JOINs with COUNT (DISTINCT).

  select p.ParentID, count (distinct c1.ChildID) 
  as Child1Counts
, COUNT (DISTINCT c2.ChildID) as Child2Counts
from dbo.Parent p
LEFT JOIN
  dbo.Child1 c1 ON c1.ParentID = p.ParentID
LEFT JOIN
  dbo.Child2 c2 ON c2.ParentID = p.ParentID
group by p.ParentID order by p.ParentID

&#9;Why the COUNT (DISTINCT) and not a COUNT (*) or COUNT (ChildIDx)? Think, for a moment, about what happens in a JOIN. For a given ParentID in Parent, you would have n rows in Child1 and m rows in Child2. Counting them without the DISTINCT would give you m x n for both Child1 and Child2. That's why the DISTINCT keyword is essential here.

&#9;Finally, let's try one of my personal favorites–derived tables. Just as in Listing 4, you'll be using LEFT JOINs, but in a different way. Take a look at Listing 5.

Listing 5. Rows per ParentID for each child table–LEFT JOINs with derived tables.

  select p.ParentID 
, isnull (c1.Child1Counts, 0) Child1Counts
, isnull (c2.Child2Counts, 0) Child2Counts
from dbo.Parent p
LEFT JOIN
( select ParentID, count (*) Child1Counts
  from dbo.Child1 
  group by ParentID
) c1 on c1.ParentID = p.ParentID
LEFT JOIN
( select ParentID, count (*) Child2Counts
  from dbo.Child2
  group by ParentID
) c2 on c2.ParentID = p.ParentID
order by p.ParentID

&#9;Each derived table calculates a COUNT (*) for the corresponding ParentID, through the GROUP BY. The derived table is then LEFT JOINed onto Parent through ParentID. The ISNULL()s are necessary, since there's the possibility that there may not be a row in the derived table that corresponds to the same ParentID in Parent.

&#9;At the beginning of this piece, I talked about performance (as I often do). Let's see how each of these performs as you change the relative number of rows in the two Child tables. I'll keep the number of rows in Parent constant at 1,000 and the number of rows in Child1 also constant at 100. I'll vary the number of rows in Child2. I'll present the number of logical reads (captured by setting STATISTICS IO to ON), relative query costs, and query duration for each of the three solutions. Check out Tables 1-4.

Table 1. Performance with 10 rows.

 

Logical reads

Relative query cost

Duration (ms)

Listing 3

222

3.29

69

Listing 4

274,721

93.41

1,981

Listing 5

222

3.29

70

Table 2. Performance with 50 rows.

 

Logical reads

Relative query cost

Duration (ms)

Listing 3

274

0.94

78

Listing 4

1,264,542

98.12

8,343

Listing 5

274

0.94

78

Table 3. Performance with 100 rows.

 

Logical reads

Relative query cost

Duration (ms)

Listing 3

340

0.54

100

Listing 4

2,506,889

98.92

16,409

Listing 5

340

0.54

102

Table 4. Performance with 200 rows.

 

Logical reads

Relative query cost

Duration (ms)

Listing 3

471

0.29

144

Listing 4

4,967,119

99.42

32,715

Listing 5

471

0.29

143

&#9;Each test involved 100 iterations, and the average was calculated. I cleared the data cache after each iteration, using DBCC DROPCLEANBUFFERS. Clearly, Listing 4 is the obvious loser. The logical reads and duration shoot up almost linearly with the number of rows in Child2, and the relative query cost is in the high 90s across the board. Another interesting result is that the numbers for Listings 3 and 5 are essentially identical.

&#9;Looking at the graphical SHOWPLAN output, the query plans for Listings 3 and 5 are identical. Listing 4, however, is much busier, with two Table Spool/Eager Spool operations. These involve creating a work table in tempdb behind the scenes, thus giving SQL Server more work to do.

&#9;So, coming back to my preaching at the beginning of this article, what if you were used to doing things only with JOINs and first elected to implement Listing 4? Your users got their rows and columns, right? The results were correct, right? Yeah, but the performance was glacial, right? That's why it really is worthwhile to look at STATISTICS IO (and the graphical query plan), just to see if things are performing as they should.

&#9;Another thing to consider is the much-maligned correlated subquery. Those who tell you that JOINs are always faster than correlated subqueries need to reconsider their stance. Indeed, you've seen here that the correlated subquery outperformed one JOIN solution, while matching the performance of the other.

Download 501TOM.ZIP

To find out more about SQL Server Professional and Pinnacle Publishing, visit their Web site at http://www.pinpub.com/

Note: This is not a Microsoft Corporation Web site. Microsoft is not responsible for its content.

This article is reproduced from the January 2005 issue of SQL Server Professional. Copyright 2005, by Pinnacle Publishing, Inc., unless otherwise noted. All rights are reserved. SQL Server Professional is an independently produced publication of Pinnacle Publishing, Inc. No part of this article may be used or reproduced in any fashion (except in brief quotations used in critical articles and reviews) without prior consent of Pinnacle Publishing, Inc. To contact Pinnacle Publishing, Inc., please call 1-800-788-1900.