This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

Dr. Tom's Workshop: A Small Fish in a Big Pond

Tom Moreau

In Dr. Tom Moreau's March 2003 column—"Sleeping with Elephants"—he explained how to handle situations where data is heavily skewed to one value. This month, he has a go at the other end of the spectrum—getting at the small minority.

Have you ever had a table with a flag column—one that takes on only two values, say 0 or 1? Sure you have. When you consider indexing a table, the flag column is usually overlooked because it isn't selective. What if the data in the flag column is heavily skewed? For example, what if only half a percent takes on the value 1, and the rest are 0? This might occur, for example, in a customers table with a flag to indicate whether the person is a staff member. Since it isn't likely that the majority of your customers are staff members, if you want to get a phone list for your staff members, you could be looking at a table scan.

If you find yourself frequently going after the rows with the flags set to 1, you're likely to get a performance hit unless you also include another, indexed column in the search criteria. Perhaps you are indeed looking for all rows where the flag is set. Well, there is a solution here, and it takes one of the features of SQL Server 2000 to do it.

Indexed views allow you to store not just the query, but also to materialize the data that the view represents. Even better, the optimizer is smart enough to figure out that, even if your query doesn't directly reference the view, it can still use the view in order to satisfy your query. In other words, if you run a SELECT against the base tables used by the indexed view, the optimizer can choose to use the view instead. Likely, your view is smaller than the underlying tables, since it has filter criteria (as specified in the WHERE clause), and it may also have fewer columns. Either way, that translates into less data to scan. However, the data does take up disk space in addition to the space occupied by the underlying base tables.

To demonstrate this, I put together a Customers table (see Listing 1) with a flag called IsEmployee. In the build script (included in the accompanying Download), I skewed the distribution of the values of this flag such that only half a percent took on the value of 1, with the remainder set to 0.

Listing 1. Table layout for the Customers table.

  create table Customers
(
   CustomerID int           not null
                            identity
,  Title      char (5)      not null
,  FirstName  varchar (25)  not null
,  LastName   varchar (30)  not null
,  Number     int           not null
,  Street     varchar (50)  not null
,  City       varchar (50)  not null
,  StateCode  char (2)      not null
,  Phone      char (10)     not null
,  IsEmployee bit           not null
,  Padding    char (20)     not null
              default (replicate (' ', 20))
)
go
create clustered index C_Customers
on Customers (LastName, FirstName)
go
alter table Customers
add
    constraint PK_Customers
    primary key nonclustered (CustomerID)
go
create index NC1_Customers
on Customers (StateCode)
go

I populated the table with 3.2 million rows, yielding a fairly large table 1GB in size, of which 660MB was data. A little over 16,000 rows had the IsEmployee flag set to 1. Then I ran the following query to generate the employee phone list:

  select LastName, FirstName, Phone
from dbo.Customers
where IsEmployee = cast (1 as bit)
order by LastName, FirstName

Since you'd often be going after the phone list, it makes sense to put the query into a view. This doesn't speed up data retrieval—well, at least not yet. If you create the view with the SCHEMABINDING option, you have fulfilled one of many requirements to make the view "indexable," but there are other requirements—no outer joins or subqueries, for example. Despite these limitations, you're still good to go if all you want is your phone list. Your view in Listing 2 is ready to be indexed.

Listing 2. Build script for Employees view.

  create view Employees
with schemabinding
as
select
  CustomerID
, LastName
, FirstName
, Phone
from
    dbo.Customers
where
  IsEmployee = cast (1 as bit)
go

Note that although IsEmployee is part of the WHERE clause, it isn't needed in the SELECT list—unless, of course, you actually want to return it to the user. The next step is to materialize the view by creating a unique clustered index. Yes, the clustered index has to be unique—another requirement for the indexed view. (You can add nonclustered indexes once you've built the clustered one.) In this case, the CustomerID is unique, so that's a good candidate. Since you'd presumably be sorting this on the employee name, I've created a nonclustered index on (LastName, FirstName)—see Listing 3. I ran sp_spaceused on it and got 1.1MB total, with 680KB for just the data. That would fit on a floppy!

Listing 3. Index script for Employees view.

  create unique clustered index C_Employees 
on Employees (CustomerID)
create index NC1_Employees on Employees
(LastName, FirstName)

Once you've done this, re-run the query and then look at the query plan. Here, you see that the nonclustered index of the Employees view is referenced—not the Customers table itself. The response time is a lot better, too. The indexed view solution took a mere 1,433 milliseconds (ms), whereas the base tables version took 66,133 ms. Sweet! (I freed the proc and data caches before each run to level the playing field.)

To get things right, you have to realize that if your query references a column in the base tables that doesn't appear in the view, the query will revert to the base tables. Check out what happens when you change the original query to add the StateCode:

  select LastName, FirstName, Phone, StateCode
from dbo.Customers
where IsEmployee = cast (1 as bit)
order by LastName, FirstName

Now you're back to the old base tables—together with the poor performance. (The query took about 66,486 ms on my system.) The optimizer won't use the view in conjunction with a bookmark lookup on the corresponding rows in the base table. That's why it's always essential to run a complete load test with a representative script to ferret out any queries that would miss out on using your indexed view. What if you really were interested in the StateCode for each employee? Simple—just add the StateCode to the view.

Another thing you have to be careful with is how you handle SARGs in your WHERE clause. In this example, IsEmployee is a bit datatype. The constant—1—can be taken as bit, tinyint, smallint, int, or bigint. If you want to be sure that the optimizer uses the correct datatype—and thus use the indexed view—then make sure you cast it to bit in this case. Fail to do this and you're back to base tables.

Microsoft did its homework on this. You don't have to grant permission to access the view, unless you're directly accessing that view. Here's what I mean. Let's say that you're running your employee phone list query against the Customers table. Even though the optimizer wisely decides to use the Employees view, you don't need permission to access Employees in this case. However, if you explicitly did the SELECT against Employees, then you would need SELECT permission. Very nice.

A word about design

There is a way to avoid going the indexed view route—as long as you have control over the design. In our scenario, there are far fewer employees than there are regular customers. Besides, employees really are different from customers. You should model them separately and thus keep them in separate tables. For example, employees process orders, report to managers, have pay records, and so on. Customers don't—at least as far as your enterprise is concerned.

You may be asking yourself, "What if I just put an index on IsEmployee in the Customers table?" First, the index isn't really selective, since IsEmployee can only take on the values 0 and 1, which means that the index isn't likely to be used. Second, even if the index were used, this same index now has to be maintained for all of the rows in the Customers table—not just those for your employees. In the indexed view, you're maintaining indexes just for employees. These don't get updated for the other 99.5 percent of your data—the regular customers.

However, life isn't always that simple. If you're dealing with a vendor app, you may not be allowed to change the design—but you will be able to use an indexed view to supplement what's there. This has happened to me more than once.

Bear in mind that if the connection from the SQL client doesn't have the right SET options set, you'll be dead in the water. (Check out "SET Options That Affect Results" in the Books Online [BOL] for further details on those SET options.) For example, set ANSI_NULLS to OFF and then run the phone list query. Now you're back to one-minute response time where the optimizer doesn't consider the indexed view. That's not all. If you now try to modify any data in the Customers table, you'll get a 1934 error, telling you that the update failed because the necessary option wasn't set. Be aware that the SQL Server OLE-DB provider and ODBC driver set the options for you automatically, while DB-LIB and Embedded SQL for C do not. Bottom line: Use OLE-DB, and you'll avoid these little problems completely. Remember that vendor app I mentioned earlier? In every case, they were using DB-LIB (can you believe it?), so I was scuppered [defeated, done in—Ed.] until they upgraded to OLE-DB.

If I had just one wish...

Coming back to that design where everyone—customer or employee—is in the one table, I did see something I'd put on my wish list for a future release of SQL Server. Let's face it—since the indexed view is materialized, it's really acting like a table. After all, you can index it. Well, why not let us make a foreign key reference to it? That way, you could enforce the business rule that only employees can process an order, whereas both employees and customers can place orders.

I'd also like the optimizer to be able to use the indexed view to service most of the query—and then use a bookmark lookup on the base tables when necessary to pick up the remaining data where appropriate. Remember that earlier example where adding the StateCode to the SELECT list caused the optimizer to scan the base table? This would get you around that. Okay, that was two wishes, so sue me!

Well, that's it for this month. Let me know if—and how—you're using indexed views in your enterprise.

Download 402TOM.ZIP

  • 829444 A Floating Point Exception Occurs During the Optimization of a Query
  • 322002 HOW TO: Enforce Uniqueness For All Non-Null Values (ANSI UNIQUE CONSTRAINT)
  • 311145 BUG: Update of Column That is Part of an Indexed View May Cause a Handled Access Violation
  • 310714 BUG: DBCC DBREINDEX Can Cause Error 8623 on Indexed View
  • 10624 BUG: DBCC DBREINDEX on Indexed View May Cause Handled Access Violation
  • 305333 PRB: Error "INSERT Failed" When You Update Table Referenced in an Indexed View
  • 30080 BUG: Property Changes in the Merging Changes Tab of an Indexed View Article Are Not Saved
  • 270054 PRB: Indexed Views Can Be Created on All Versions of SQL Server 2000

To find out more about SQL Server Professional and Pinnacle Publishing, visit their Web site at http://www.pinpub.com/

Note: This is not a Microsoft Corporation Web site. Microsoft is not responsible for its content.

This article is reproduced from the thismonth 2004 issue of SQL Server Professional. Copyright 2004, by Pinnacle Publishing, Inc., unless otherwise noted. All rights are reserved. SQL Server Professional is an independently produced publication of Pinnacle Publishing, Inc. No part of this article may be used or reproduced in any fashion (except in brief quotations used in critical articles and reviews) without prior consent of Pinnacle Publishing, Inc. To contact Pinnacle Publishing, Inc., please call 1-800-788-1900.