Using HTML Meta Properties with Microsoft Index Server

 

David Lee
Windows NT Indexing Services Team
Microsoft Corporation

April 30, 1998

Introduction

This is the fourth in a series of articles to help you understand and deploy Microsoft search solutions on your Web sites and intranets. Krishna Nareddy wrote the first three articles. The first article, "Anatomy of a Search Solution," helped you understand what to expect of a search solution to meet your site's needs. The second, "Introduction to Microsoft Index Server," introduced you to the features and capabilities of Index Server. The third, "Indexing with Microsoft Index Server", was designed to help you understand, manage, and fine-tune the indexer.

This article discusses how Index Server makes use of HTML meta properties and describes an extension that enables Index Server to leverage HTML meta property datatypes other than strings. I'll show how to add meta properties to HTML documents, how to issue content queries on meta properties, and how to store meta property values in Index Server's property cache. Property values stored in the property cache can be used in property value queries and sorting, and they can display in query results.

Information contained in this article applies to Index Server 2.0, shipped with Microsoft® Windows NT® Option Pack 4.0. Most of it also applies to Index Server 3.0, which is scheduled to ship with Windows 2000. For additional details about many of the topics covered in this article, please refer to the Index Server documentation (MSDN Library, Platform SDK).

Index Server's HTML Meta Property Features

HTML meta properties are a great way to add richness and value to the content of a Web site. They are used to give detailed information about each document, making them easier to locate and more self-descriptive.

Meta properties are stored in the <head> section of HTML documents. Each property has a name and a value. You can define as many as you like, and there aren't any rules about what names to use for the properties. Here's an example meta property that might be used to identify a document's author:

<meta name="DocumentAuthor" content="Krishna Nareddy">

A great way to see how people use meta tags is to go to your favorite Web site, click on View Source, and examine the <head> section. The most typical use of meta properties is to provide keywords that search engines can use to locate documents that best match a query.

An important thing to remember about meta properties is to use them consistently for all the documents on a Web site. The more consistent the property values, the more useful they become. For example, if every document has the DocumentAuthor property, it's easy to guarantee that all the documents by a given author can be found.

Web browsers don't display meta property values, so the properties can contain information about documents that might not be of interest to visitors browsing your site.

Index Server provides full-text content indexing and searching of HTML meta properties. Each meta property can be searched independently of the others.

Storing meta property values in Index Server's property cache enables additional functionality. If values are in the property cache they can be used in property value queries, be displayed in query results, or be specified as the sort order for a query.

What follows is an example of how to use Index Server's meta property features for a hypothetical Web site that contains pages about different breeds of dogs. Each page has information about one particular breed. I'll show how Index Server takes advantage of meta properties to make it easier for people to find the Web pages in which they are interested. The Index Server sample query forms query.idq, query.htx, and query.asp are modified to demonstrate the techniques.

All of the files referred to in this article are available from the sample download link at the top of the article.

Meta Property Content Queries

The most obvious use of meta properties with Index Server is to allow the user of a Web site to issue content queries over the property values. This is why meta properties are included in the HTML specification.

The first step to enable content queries over meta property values is to decide what meta properties to use in the document set. Care should be taken to choose properties that can consistently be applied to the documents. Then, update all of the documents to include values for each property.

The imaginary canine Web site uses meta properties for breed name, weight, and when and where the breed originated. Each document describes one breed of dog and has appropriate values for the meta properties. Three sample pages contain these property values:

Dog1.htm:
<meta name="breedName" content="Australian Terrier">
<meta name="breedWeight" content="8">
<meta name="breedFirstBred" content="1872">
<meta name="breedOrigin" content="Australia">
Dog2.htm:
<meta name="breedName" content="Australian Cattle Dog">
<meta name="breedWeight" content="35">
<meta name="breedFirstBred" content="1840">
<meta name="breedOrigin" content="Australia">
Dog3.htm:
<meta name="breedName" content="Belgian Sheepdog">
<meta name="breedWeight" content="30">
<meta name="breedFirstBred" content="1891">
<meta name="breedOrigin" content="Belgium">

When Index Server indexes documents containing meta properties like these, it treats all property values as strings and stores the data in its index. Values with multiple words are parsed and each word is stored independently in the index.

Before queries can be issued over a meta property, Index Server needs to be given a name for the property. Adding a property definition to the [names] section of an IDQ file accomplishes this. The modified Index Server sample file query.idq contains this line, which defines the breedName property:

breedName (DBTYPE_WSTR) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 breedName

Defining a property in Active Server Pages (ASP) using IXSSO queries is similar. This line is added to the sample file query.asp:

Q.DefineColumn "breedName (DBTYPE_WSTR) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 breedName"

These definitions tell Index Server that the HTML meta property named breedName will be referred to as breedName. The property is defined as a wide (Unicode) string value, and the long string of letters and numbers is the GUID that uniquely identifies all HTML meta properties.

Once the property name is defined, it can be used to issue content queries. To find documents containing the word "Australian" in the breedName meta property, issue the query @breedName Australian.

Meta property queries can be used in conjunction with more traditional file content queries. This technique is often used to narrow the query results to a subset of the pages on a Web site. For example, to find all pages containing the word terrier for breeds that originated in Australia, you can issue the query terrier and @breedOrigin Australia.

Displaying HTML Meta Property Values in Query Results

Content queries over meta properties are great, but just as useful is the ability to display meta property values in query results. For example, in the dog Web site it would be useful to display in query results the name of the country in which the breed originated.

One might think that since Index Server stores meta property values in its index it would be straightforward to make the values available in query results. However, this isn't the case. The index is organized to take a query key and produce a list of files that match, not the other way around. To display meta property values using the index would require either scanning the entire index or opening each query result file and pulling out the meta property values. Neither of these approaches is efficient.

Index Server uses a different data structure to make property value retrieval efficient, called the property cache. Meta property values must be added to Index Server's property cache to make them available for display.

To add meta properties to the property cache, invoke the Index Server Microsoft Management Console (MMC) administration tool. Open the catalog and select Properties from the tree view. Select the property to be added and then right-click it. Then click Properties and check the Cached box. Use the datatype VT_LPWSTR and the default size of 4, which means Index Server will manage the storage. Save the property cache changes by right-clicking the Properties item in the tree pane and then clicking commit.

After the property is added to the schema of the property cache, each document is given a null value for the property. Documents must be reindexed so that the values from each document are written to the property cache, since cache values are updated when a document is indexed.

To rescan a directory, use the Index Server MMC administration tool, select the directory containing your documents, right-click it, and force a full rescan of the files. Once the index is up-to-date again, the meta property will be available in the property cache.

For additional details about the process of adding a property to the property cache, please refer to the Index Server documentation.

To retrieve the meta property value from the property cache when queries are issued, define it in the same way as described for breedName above. Then, add the property name to the list of retrieved columns.

Getting back to the canine sample Web site, here are the changes for query.idq to support retrieving the breedOrigin property in query results:

breedOrigin (DBTYPE_WSTR) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 breedOrigin
CiColumns=vpath,DocTitle,write,breedOrigin

Here are the changes for query.asp:

Q.DefineColumn "breedOrigin (DBTYPE_WSTR) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 breedOrigin"
Q.Columns = "DocTitle, vpath, filename, size, write, characterization, rank, breedOrigin"

Now all that's left is to use the property in the query results. The modified sample template file query.htx refers to <%breedOrigin%> in its detail section. The sample query.asp uses RS("breedOrigin").

There are many other uses for displaying meta property values in query results. One example is to have a Web site where each page is a description of some other Web site. Each document would have a meta property whose value contains a URL for the site being described. The query result page could then point to both the local descriptions and the URLs for the remote Web sites.

In the case of the dog breed Web site, each document could have a meta property with a URL pointing to information elsewhere on the Internet about the breed. This makes it easy to leverage content on other Web sites.

Meta Property Value Queries

Property value queries are different from property content queries in that the value in a document must match a query exactly for the document to be returned in a query result. For example, the content query @breedName Australian matches both dog1.htm and dog2.htm. The property value query @breedName=Australian Terrier matches just dog1.htm. The equal sign following the property name is the way to tell Index Server to run a property value query instead of a content query.

In order to issue property value queries for meta properties, the value must be in the Index Server property cache. The previous section contains instructions for adding a property to the property cache.

The meta property name must also be defined in the same way as shown in the examples above for query.idq and query.asp.

Once the property values are in the property cache and the property name has been defined, meta property value queries can be issued.

Meta property content queries solve most searching needs. It is less common and more of an administrative burden to enable value queries, since property value queries require modification of the property cache. Only use value queries when content queries won't solve the problem, i.e. when you need an exact match between query and meta property values.

Sorting on Meta Property Values

Another benefit of storing meta property values in the property cache is that they can be used to sort query results. Sorting on meta property values can make query results more useful.

To specify a meta property as the sort order for a query, simply include the name in the sort specification. For example, the sample query form query.idq was modified to sort on the breedOrigin property:

CiSort=breedOrigin[d]

The [d] here means a descending sort. Use [a] for an ascending sort.

The query.asp sample contains this equivalent modification:

q.SortBy = "breedOrigin[d]"

The HTML Property Filter

As shown, Index Server enables content and value queries over meta property values, and can display and sort on the values in query results. What more do you want? Actually, it's useful to have a little more.

Index Server treats all meta property values as strings. This prohibits doing property value queries like @breedWeight < 20, to find dogs that weigh less than twenty pounds. This is because Index Server will treat string values like "100" as being less than "20". That's where the HTML property filter comes in.

The html property filter sits on top of Index Server's built-in HTML filter and converts meta property values from strings to data types specified in its configuration file. If the converted values are stored in the property cache, they can be used in property value queries, retrieved in query results, and be used for sorting.

The following is a description of how to install the HTML property filter and configure it to support the breedWeight meta property as an integer value, instead of as a string.

Installing the HTML Property Filter

Install the HTML property filter by copying htmlprop.dll and htmlprop.ini to the system32 directory. These files are included in the sample download at the top of this article. Then, use regedt32.exe to append the complete path to htmlprop.dll (for example, c:\winnt\system32\htmlprop.dll) to this registry key:

Hkey_local_machine\system\currentcontrolset\control\contentindex\DllsToRegister

It's important to put the htmlprop.dll path at the end of the list so it'll override Index Server's built-in HTML filter.

You'll need to stop and restart the Content Index service for this change to have effect.

The HTML property filter won't have any impact on Index Server except for converting meta property values as specified in htmlprop.ini.

To uninstall the HTML property filter, stop the Content Index service, remove the path added to the registry value, delete htmlprop.dll and htmlprop.ini, and restart the Content Index service.

Using the HTML Property Filter

The configuration file htmlprop.ini contains instructions that the HTML property filter uses to know which meta properties should be converted, and to what datatypes. Meta properties that aren't defined in htmlprop.ini are not modified. The format of htmlprop.ini is identical to the [names] section of Index Server's IDQ files. The htmlprop.ini included with this article defines the breedWeight property as an unsigned integer value:

breedWeight (DBTYPE_UI4) = d1b5d3f0-c0b3-11cf-9a92-00a0c908dbf1 breedWeight

The same property name definition also exists in the modified sample files query.idq and query.asp. This enables use of the breedWeight property in queries, result display, and query sort order.

Properties that are converted by the HTML property filter must be stored in Index Server's property cache to be used. The only difference from the discussion above about adding a property to the cache is that the appropriate datatype must be chosen instead of VT_LPWSTR. For example, the breedWeight property should be defined as type VT_UI4.

Once all of the steps above have been taken, the converted meta property values can be used for property value queries, display in query results, and sort specifications.

Table 1 lists the conversion datatypes supported by the HTML property filter. These types can be used when adding entries to htmlprop.ini.

Table 1. HTML Property Filter Conversion Types

Datatype Description
DBTYPE_UI1 Unsigned 1 byte integer
DBTYPE_I2 Signed 2 byte integer
DBTYPE_UI2 Unsigned 2 byte integer
DBTYPE_I4 Signed 4 byte integer
DBTYPE_UI4 Unsigned 4 byte integer
DBTYPE_I8 Signed 8 byte integer
DBTYPE_UI8 Unsigned 8 byte integer
DBTYPE_R4 4 byte floating point number
DBTYPE_R8 8 byte floating point number
DBTYPE_BOOL Boolean. 1, T, or TRUE for true
VT_FILETIME Date and time, in Index Server format

When the HTML property filter can't parse a numeric meta property value that it was asked to convert, it sets the value to 0. To find files with suspect meta property values, try doing a query for that property with a value of 0.

Date (VT_FILETIME) values must be specified in the Index Server format. The format is yyyy-mm-dd hh:mm:ss. For example, 12 noon on July 1, 1998 is expressed as 1998-07-01 12:00:00. Dates that can't be converted cause the document not to be filtered, so take care when using dates. Use the Index Server HTML administration forms to find documents that could not be filtered.

Additional date formats and datatypes can be added to the HTML property filter by modifying the source code, which is included with this article.

The HTML Property Filter Implementation

Index Server indexes files by using Component Object Model (COM) objects that implement the IFilter interface. Index Server comes with IFilter objects for HTML, plain text, and Microsoft Office files. (Note that the book Microsoft Internet Information Server Resource Kit contains an IFilter for C++ that's handy for indexing source code.)

One approach to writing the HTML property IFilter would be to start from scratch and write lots of code to parse HTML. This is problematic for a couple of reasons. First, Index Server's HTML filter goes to a lot of trouble to do a good job of handling all the quirks of HTML. Second, HTML is constantly evolving, so it would take a fair amount of maintenance work just to keep up with the changes. That sounds like too much work.

Why not leverage the existing HTML filter in Index Server? Fortunately, the IFilter specification makes that simple. The HTML property filter takes over the job of filtering HTML files for Index Server, and hands off the work to Index Server's HTML filter by forwarding each IFilter interface method. When the property filter sees a meta property value being returned from the HTML filter to Index Server, it checks to see whether it needs to convert the value, does so if needed, and passes the value on.

A perusal of the source code for the filter (htmlprop.cxx) will show that most of the code is to load the real HTML filter and call through to it for each of the IFilter methods. The rest of the code parses the htmlprop.ini file and converts property values as requested.

HTML meta properties open avenues for applications of Index Server. The ability to store meta property values in the property cache and configure datatypes for the values enables Index Server to solve problems that would otherwise require databases and sophisticated programming.