Relationships Between Sets

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

T-SQL Black Belt
Set Members and Relationships
Use a shortcut to identify the data you need

Itzik Ben-Gan

Part of a T-SQL programmer's job is to translate application users' requests for information into queries. Requests commonly involve identifying rows or groups of rows that meet some criteria—say, items that share a certain relationship to another group of items. For example, sometimes you need to identify all orders that have the same order parts as another order. I got the idea for this article from a recent problem that SQL Server Magazine reader John Lombardo sent me.

As an introduction to the problems I discuss in this article, let's review a few concepts from Set Theory that represent relationships between sets. I use these concepts to define the criteria for the article's problems. Capital letters specify set names, numbered small letters specify set members, and curly brackets containing set members specify the sets themselves.

Set Theory describes certain relationships that can exist between sets:

  • Set U is equal to set V if all of U's members exist in V and all of V's members exist in U—for example, U = {u1, u2, u3}, V = {u1, u2, u3}.
  • U is a subset of V if all of U's members exist in V. When U is equal to V, then U is also a subset of V and V is also a subset of U.
  • When U is a subset of V but V isn't a subset of U, U is a proper subset of V—for example, U = {u1, u2, u3}, V = {u1, u2, u3, u4}.

The tasks that I discuss in this article involve identifying groups of items, or sets, that have a certain relationship to another group of items. Let's look at an example that sets up the problems.

The Orders and OrderDetails Scenario

The scenario I use involves example Orders and OrderDetails tables, which you can create and populate in tempdb by running the script that Listing 1 shows. In these tables, I included only the columns that are relevant for this discussion—that is, the orderid column in Orders and the orderid and productid columns in OrderDetails. Each order in the Orders table might have zero or more related rows in the OrderDetails table, each containing a different product. Each order is an instance of an entity in its own right, but in this article, I refer to an order as the set of order details that belong to it.

Your application's user enters a set of products representing a new order, which your code stores in the #ProdList temporary table:

  
CREATE TABLE #ProdList(productid int NOT NULL PRIMARY KEY)
INSERT INTO #ProdList VALUES(2)
INSERT INTO #ProdList VALUES(3)
INSERT INTO #ProdList VALUES(4)

You receive several tasks from the marketing department that require you to identify different relationships between the new order and existing ones. Those relationships might be important for the marketing department to identify purchase patterns, to consider discounts for certain groups of products, and so on.

Task 1: P is a subset of O. Your first task is to identify the orders that contain all products in #ProdList. In set terminology, if O represents the set of order details making up an order and P represents the set of products in #ProdList, you're looking for orders for which P is a subset of O. Your query should return orders A and B.

The following query gives you the solution for the task:

  
SELECT orderid
FROM OrderDetails
WHERE productid IN(SELECT productid FROM #ProdList)
GROUP BY orderid
HAVING COUNT(*) = (SELECT COUNT(*) FROM #ProdList)

This code queries the OrderDetails table, filtering for the rows containing only products that exist in #ProdList. The query groups the result by orderid, then filters the groups that have the same count of products as in #ProdList. The query returns only the orders that contain all the products from #ProdList.

Task 2: P is equal to O. Your second task is to identify the orders that contain all the products in #ProdList and no other product. In set terminology, you're looking for orders for which P is equal to O. In the previous query, the WHERE clause removed rows containing products that didn't exist in #ProdList, so the query didn't consider those. This time, you need to consider all rows. To solve this problem, you can first write a query that returns all rows from OrderDetails, appending to the result a column called inlist, which contains 1 for products that exist in #ProdList and -1 for products that don't:

  
SELECT *,
  CASE
    WHEN productid IN(SELECT
      productid FROM #ProdList)
      THEN 1
    ELSE -1
  END AS inlist
FROM OrderDetails

Figure 1 shows the results of this query.

The code in Listing 2 forms a derived table out of the previous query, groups its rows by orderid, and returns only those groups in which the sum of the inlist value is equal to the number of products in #ProdList. For the condition in the HAVING clause to be true, an order must contain all products in #ProdList and no other product, and that's exactly what the request specified. You should get only order B in the result.

Task 3: P is a proper subset of O. Task 3 involves identifying the orders that contain all products from #ProdList and at least one other product. In set terminology, you want all orders for which P is a proper subset of O. You can use a trick similar to the one in the previous query, in which you use a derived table that has all order details and an additional column called inlist. This time, have the inlist column contain 1 for products that exist in #ProdList and NULL for those that don't, as Listing 3 shows.

In the HAVING clause, make sure that the query returns only groups that contain all products from #ProdList by comparing the number of non-NULL inlist values—COUNT(inlist)—to the number of products in #ProdList. By checking that the total number of rows in the group is greater than the number of non-NULL inlist values, you ensure that the returned orders contain at least one product that doesn't exist in #ProdList. You should get only order A in the result.

Task 4: O is a subset of P. Task 4 reverses the roles of P and O. You need to return the orders for which all products exist in #ProdList. In set terminology, you need to return all orders for which O is a subset of P. To achieve this result, you can perform a left outer join between OrderDetails and #ProdList. The result of such a join returns all order details, regardless of whether a match is found in #ProdList, with NULLs in the columns from #ProdList signifying nonmatches. The orders you're looking for are those in which all order details have matches in #ProdList—in other words, orders that have no NULLs in the columns from #ProdList. An easy way to identify those orders is to group the result of the join by orderid and use the HAVING clause to check that the count of all rows in the group is equal to the count of non-null productid values from #ProdList. This query, which Listing 4 shows, should return orders B, C, and D.

Task 5: O is a proper subset of P. The last task is to identify the orders for which all products exist in #ProdList, while #ProdList contains at least one additional product. In set notation, you're looking for orders in which O is a proper subset of P. You need to make only a small addition to the previous query, which returns orders that are subsets (not necessarily proper subsets) of #ProdList. As the last line of Listing 5 shows, you add to the HAVING clause an expression that ensures that the number of products in the order is smaller than the number of products in #ProdList. You should get orders C and D in the result.

What's Next?

In relational algebra, the problems I discussed in this article are called relational division. This month, I showed you solutions that were based on aggregations. Next month, I'll discuss how to solve these problems by using a different approach that's based on correlated subqueries. I'll also explore problems in which you need to identify sets that meet certain collective member criteria, such as identifying the orders that contain either products 1 and 3 or products 2 and 4. Meanwhile, see if you can come up with solutions to this month's example problems by using a different approach than the one I used. For additional reading about relational division in ANSI SQL, see Joe Celko's book SQL for Smarties, 2nd Edition (Morgan Kaufman, 1999).

Bugs, comments, suggestions | Legal | Privacy | Advertising

Copyright © 2003 Penton Media, Inc. All rights reserved.