How to Write a Filter for Use by SharePoint Portal Server 2003 and Other Microsoft Search-Based Products

 

David Lee
Microsoft Corporation

April 2004

Applies to:
    Microsoft® Office SharePoint™ Portal Server 2003
    Microsoft SQL Server™ 2000
    Microsoft Exchange Server 2003
    Microsoft Exchange Server 2000
    Microsoft Windows Server™ 2003

Summary: Explore a sample filter implementation as you learn how Microsoft Office SharePoint Portal Server 2003 and other Microsoft Search-based products such as Microsoft SQL Server, Microsoft Exchange Server, and Microsoft Indexing Service use filters to extract the content and properties of files for inclusion in a full-text index. (19 printed pages)

Contents

Introduction
IFilter and the IPersist Family of Interfaces
Writing a Filter Using IFilter
Implementation Notes
Testing Filter DLLs
Conclusion

Introduction

Search-based products that use the service, Microsoft Search (MSSearch), such as Microsoft® Office SharePoint™ Portal Server 2003, Microsoft SQL Server, Microsoft Exchange Server, and Microsoft Indexing Service use filters to extract the content and properties of files for inclusion in a full-text index. This article describes how to write a filter these applications can use, and discusses the following:

  • COM interface filter support
  • Filter implementation
  • Filter installation and testing

The Microsoft Windows Platform SDK in the MSDN Library includes a text file filter called SmpFilt (simple filter) that is installed in the \mssdk\Samples\WinBase\Index\SmpFilt directory. SmpFilt filters plain text files with the .smp extension. In this article, we examine the sample filter implementation and work with a copy that you can use for reference.

IFilter and the IPersist Family of Interfaces

A filter is a COM object that implements the IFilter interface and one or more of the IPersist family of interfaces. Applications such as Indexing Service create filter objects and invoke methods on these interfaces to retrieve the text and properties of documents. Writing an IFilter is the only way you can ensure that your file format is included in the index.

Typically, filters implement four interfaces:

  • IFilter. Contains methods to retrieve text and properties from a file.
  • IPersistFile. Contains a method to load a file by the absolute path.
  • IPersistStream. Contains a method to load a file from an IStream.
  • IPersistStorage. Contains a method to load a file by the absolute path.

For general information on these interfaces, see the Platform SDK.

Note   To maintain the focus on the four interfaces, this article does not cover methods from the standard IUnknown interface.

IFilter Interface

To write a COM filter that you can use with MSSearch-based products to extract the content and properties of files for inclusion in a full-text index, you must implement the IFilter interface. This interface exposes the methods that the search product uses to retrieve text and properties from a file.

The following table describes the methods of the IFilter interface.

Table 1. IFilter interface methods

Method Description
Init(ULONG fFlags, ULONG cAttrib, FULLPROPSPEC * aAttrib, ULONG *pFlags)
Initializes a filtering session.
GetChunk(STAT_CHUNK *pStat)
Positions the filter at the beginning of first or next chunk and returns a descriptor.
GetText(ULONG *pcwcBuf, WCHAR *pwcBuf)
Retrieves text from the current chunk.
GetValue(PROPVARIANT **ppProp)
Retrieves property values from the current chunk.
BindRegion(FILTERREGION origPos, REFIID riid, void *ppUnk)
Retrieves an interface representing the specified portion of object. Currently reserved for internal use; do not implement. This method returns E_NOTIMPL.

IPersist Interface

The IPersist interface defines the single method GetClassID, which supplies the CLSID of an object that you can stored persistently in the system. A call to this method enables the object to specify which object handler to use in the client process, as it is used in the OLE default implementation of marshaling.

You must implement the single method of IPersist in implementing any one of the other persistence interfaces: IPersistStorage, IPersistStream, or IPersistFile. Typically, for example, you would implement IPersistStorage on an embedded object, IPersistFile on a linked object, and IPersistStream on a new moniker class, although their uses are not limited to these objects. You could implement IPersist in a situation where all that is required is to obtain the CLSID of a persistent object, as it is used in marshaling.

The IPersistFile, IPersistStream, and IPersistStorage interfaces all inherit the IPersist interface.

The following table describes the method of the IPersist interface.

Table 2. IPersist interface method

Method Description
GetClassID(CLSID * pClassID) Returns the class identifier (CLSID) for the component object. In the case of a filter, this is the file type identifier. If this is a new filter, you use uuidgen.exe available from the Platform SDK to generate a unique CLSID. You must implement the GetClassID method.

IPersistFile Interface

If you use a filter to include stand-alone files on a file system in the full-text index, you must implement the IPersistFile interface because it contains a method to load a file by absolute path. The sample filter implements only the IFilter and IPersistFile interfaces.

MSSearch detects the interfaces that a filter implements by using the IUnknown::QueryInterface() method. If you implement no other IPersist interfaces, then you must implement the IPersistFile interface. Most versions of MSSearch create a temporary file and use IPersistFile if a filter does not implement IPersistStream or IPersistStorage.

Because filters only operate in read-only mode, the only methods in IPersistFile that you must implement are GetClassID(), Load(), and GetCurFile(). The other methods are not called and return E_NOTIMPL because filter clients do not write to the file. The Load() method takes a path of the file to filter. The sample filter saves the path so you can use it to open the file later. GetCurFile() just returns the path of the file being filtered that was passed to Load() earlier. IPersistFile inherits from IPersist.

The following table describes the methods of the IPersistFile interface.

Table 3. IPersistFile interface methods

Method Description
IsDirty(void) Checks an object for changes since it was last saved to its current file. This method returns E_NOTIMPL in filters.
Load(LPCOLESTR pszPath, DWORD dwMode) Opens the specified file and initializes an object from the file contents. You can delay opening the file until you need it.
Save(LPCOLESTR pszPath, BOOL fRemember) Saves the object into the specified file. This method returns E_NOTIMPL in filters.
SaveCompleted(LPCOLESTR pszPath) Notifies the object that it can revert from NoScribble mode to Normal mode. This method returns E_NOTIMPL in filters.
GetCurFile(LPOLESTR *ppszPath) Gets the current name of the file associated with the object. Returns the path specified in Load().

IPersistStream Interface

The IPersistStream interface is often used if the file to include in the full-text index is embedded in another document. IPersistStream contains a method to load a file from an IStream. IPersistStream is also used to filter documents stored in non-file systems such as SQL Server databases. You should implement IPersistStream for your filter for two primary reasons:

  • To ensure future compatibility and performance. You gain efficiency by including data in the full-text index from a store other than the file system if you can handle it directly as a stream, rather than by copying the data to disk and using the IPersistFile interface.
  • To help increase security. Future versions of MSSearch may only support IPersistStream, not IPersistFile or IPersistStorage. As a result, if you implement IPersistStream for your filter, your systems could be more secure because the context in which the filter runs does not need the rights to open any files on the disk or over the network.

Because filters operate in read-only mode, you must implement only the Load() and GetClassID() methods. The other methods in this interface are not called and can return E_NOTIMPL. The Load() method should save the IStream pointer for later use when IFilter methods are invoked to retrieve the content of the stream. GetClassID() is the same as described earlier for IPersistFile. IPersistStream inherits from IPersist.

The following table describes the methods of the IPersistStream interface.

Table 4. IPersistStream interface methods

Method Description
IsDirty(void) Checks the object for changes since it was last saved. This method returns E_NOTIMPL in filters.
Load(IStream *pStm) Initializes an object from the stream where it was previously saved.
Save(IStream *pStm, BOOL fClearDirty) Saves an object into the specified stream and indicates whether the object should reset its dirty flag. This method returns E_NOTIMPL in filters.
GetSizeMax(ULARGE_INTEGER *pcbSize) Returns the size in bytes of the stream needed to save the object. This method returns E_NOTIMPL in filters.

IPersistStorage Interface

If the file format is a structured storage format, implement the IPersistStorage interface. IPersistStorage is generally used for structured storage embeddings in other structured storage files. For example, the filter for Microsoft Office System files loads filters for embeddings by using IPersistStorage.

The following table describes the methods of the IPersistStorage interface.

Table 5. IPersistStorage interface methods

Method Description
IsDirty(void) Checks if there has been a change. This method returns E_NOTIMPL in filters.
InitNew(IStorage * pStg) Creates a new storage. This method returns E_NOTIMPL in filters.
Load(IStorage * pStg) Saves the storage. This method returns E_NOTIMPL in filters.
Save(IStorage * pStg, BOOL fSameAsLoad) Returns the size in bytes of the stream needed to save the object. This method returns E_NOTIMPL in filters.
SaveCompleted(IStorage * pStg) Reserved for internal use. This method returns E_NOTIMPL in filters.
HandsOffStorage(void) Reserved for internal use. This method returns E_NOTIMPL in filters.

Writing a Filter Using IFilter

Some versions of MSSearch use the COM CoCreateInstance() function or similar functions to create filters. Other versions traverse the registry and use functions like LoadLibrary() and GetProcAddress() to load filters.

Clients of a filter first call the Load() method on either IPersistFile, IPersistStream, or IPersistStorage. Then the Init() method of the IFilter interface is invoked. GetChunk() is called, then either GetText() or GetValue() are called as many times as needed to retrieve all of the text or property values associated with the chunk. This process repeats until GetChunk() reports that there are no more chunks in the document. BindRegion() is not called, so it simply returns E_NOTIMPL.

The following example code demonstrates typical filter usage. It uses the LoadIFilter() helper function. LoadIFilter uses a means such as CoCreateInstance() to create the filter and then calls IPersistFile::Load with the name of the specified file.

Note   This example is not actual code but is used only to illustrate typical filter usage.

IFilter *pFilt;
HRESULT hr = LoadIFilter( L"c:\\file.smp", 0, &pFilt );
if ( FAILED( hr ) )
    return hr;

ULONG flags;
hr = pFilt->Init( IFILTER_INIT_APPLY_INDEX_ATTRIBUTES, 0, 0, &flags );
if ( FAILED( hr ) )
    return hr;

STAT_CHUNK stat;
while ( SUCCEEDED( hr = pFilt->GetChunk( &chunk ) ) )
{
    if ( CHUNK_TEXT == chunk.flags )
    {
        WCHAR awc[100];
        ULONG cwc = 100;
        while ( SUCCEEDED( hr = pFilt->GetText( &cwc, awc ) ) )
        {
            // process the text buffer. . .
        }
    }
    else // CHUNK_VALUE
    {
        PROPVARIANT *pVar;
        while ( SUCCEEDED( hr = pFilt->GetValue( &pVar ) ) )
        {
            // process the property value. . .

            PropVariantClear( pVar );
            CoTaskMemFree( pVar );
        }
    }

    if ( FAILED( hr ) )
        return hr;
}

return hr;

The Init() method tells a filter to get ready to emit the data in a file. Filters typically open the file specified by IPersistFile::Load() when this method is called. The arguments to Init() specify what in the document to filter and how to filter it. If an array of properties is passed to Init(), the filter should emit only those properties while the file is filtered. If the IFILTER_INIT_APPLY_INDEX_ATTRIBUTES flag is set, the filter can emit any properties and text it needs from the document being filtered. If no properties are passed and IFILTER_INIT_APPLY_INDEX_ATTRIBUTES is not set, the filter should emit only the text contents of the file. The additional flags to Init() specify how to manage paragraphs, line-breaks, and hyphens. The sample filter deals with raw text and does not use these flags.

The Webhits.dll ISAPI hit-highlighting feature of Indexing Service passes attribute arguments to Init(). Filters must either implement this functionality or gracefully ignore the parameters. All other currently shipping MSSearch-based products pass no attribute arrays in Init().

Because Init() can be called multiple times on an IFilter object, you must prepare a filter to reset any existing state and rewind the file being filtered back to the starting position. Init() is generally called once per file while filtering and twice per file by an application using the hit-highlighter feature (Webhits.dll) that shows query hits in a document. The first pass locates query matches in the file, and the second pass renders the file in HTML.

The Init() implementation of the sample filter opens the file you specified earlier in the IPersistFileLoad() method. Init() also returns the IFILTER_FLAGS_OLE_PROPERTIES flag to the caller indicating that the caller must enumerate and filter the OLE structured storage properties on the file. Otherwise, the filter itself would have to use the OLE IPropertySetStorage interface to retrieve the OLE properties on a file. Returning this flag makes it easier to write a filter. The NTFS file system in Microsoft Windows® 2000 and later versions enables all files (not just OLE structured storage files) to have arbitrary properties. To ensure that the properties get filtered by the calling process, filters must return the IFILTER_FLAGS_OLE_PROPERTIES flag. If you only want to filter a subset of the properties available to IPropertySetStorage, it should not return IFILTER_FLAGS_OLE_PROPERTIES. Rather, it should emit the chunks and values as it needs.

The GetChunk() method retrieves information about the first or next logical block of information from the file being filtered. A chunk is either a text chunk or a property value chunk. GetChunk() does not return the text or property value itself. Rather, subsequent calls to GetText() and GetValue() retrieve the body of the chunk. Text chunks are Unicode strings. Property value chunks are PROPVARIANT values that can contain a rich set of data types.

GetChunk() returns information about the current chunk in a STAT_CHUNK structure, including a monotonically increasing chunk ID, status information about how the chunk relates to the previous chunk, a flag indicating whether the chunk contains text or a value, the chunk's locale, and the chunk's property specification. The property specification consists of a CLSID and either an integer or string property identifier.

The chunk locale identifier is used to choose an appropriate word breaker, and it is very important that you correctly identify it. If the filter cannot determine the locale of the text, it should assume the default system locale, available by using GetSystemDefaultLCID(). If you control the file format and it currently does not contain locale information, you should add a user feature to enable proper local identification. Using a mismatched word breaker can lead to a bad query experience for the user.

The GetChunk() method of the sample filter reads a fixed portion of the file into a buffer, converts it to Unicode, and keeps the text for subsequent calls to GetText().

GetText() returns text from the current CHUNK_TEXT chunk. There are three success return codes from this method:

  • If the entire text from a chunk cannot fit in the buffer provided to GetText(), the method returns as much text as possible and sets the return code to S_OK. GetText() is called as many times as needed to retrieve all the text in the chunk.
  • If all remaining text fits into the buffer, the method returns FILTER_S_LAST_TEXT.
  • If no more text is available, the method returns FILTER_E_NO_MORE_TEXT.

Use of FILTER_S_LAST_TEXT is an optional optimization; filters can get by with just S_OK and FILTER_E_NO_MORE_TEXT.

The GetText() method of the sample filter copies as much of the file buffer as it can into the output buffer and tracks how much was copied so it is prepared for the next call to GetText().

GetValue() returns property values for the current CHUNK_VALUE chunk. The method sets the value and returns S_OK. If no more values are available, the method returns FILTER_E_NO_VALUES.

Property values are retrieved in PROPVARIANT structures, which can hold a wide variety of data types. GetValue() must allocate the PROPVARIANT structure itself using CoTaskMemAlloc(). The caller of GetValue() is responsible for freeing memory pointed to by the PROPVARIANT using PropVariantClear(), and for freeing the structure itself with CoTaskMemFree().

The sample filter does not emit property values, so it just returns FILTER_E_NO_VALUES from GetValue().

Filters can return property values that are not stored as OLE structured storage properties but that are derived implicitly from the file format. For example, the HTML filter emits metatag values as properties. A filter for C++ source files might report property values for the class and method names it finds. If your filter has unique properties to return, you must assign them a property specification to return in the GetChunk()STAT_CHUNK parameter. Use uuidgen.exe to generate a GUID for your new property set, then define either PROPID or string identifiers for each property in the set. If you are using these properties for queries, you must also define the properties as described in the documentation for the appropriate product.

MSSearch indexes text returned by GetText() to enable content queries. Properties returned by GetValue() are included in the full-text index as well. Text property values (with a PROPVARIANT type such as VT_LPWSTR) can also be used in content queries. All property types can be used in property value queries if the property value is made available at the time the query is executed. The most convenient way to make such properties available is to store them in the property cache.

Implementation Notes

The sample filter defines the class CSmpFilter that inherits from and implements IFilter and IPersistFile. Each instance of this class corresponds to a separate filter object. For example, if the Microsoft Indexing Service Webhits.dll is hit-highlighting .smp documents for three users at once, it creates three IFilter objects, which under the covers creates three instances of CSmpFilter.

Similar to any COM object, filter DLLs need a class factory to create instances of the filter. The sample filter class factory (CSmpFilterCF) creates instances of CSmpFilter when CreateInstance() is called.

The sample filter's implementation of DllGetClassObject(), DllCanUnloadNow(), and DllMain() are typical of COM DLLs and contain nothing filter-specific.

Advanced Filter Issues

Filters must be multi-thread aware. If a filter uses global data, it must be protected with synchronization primitives such as critical sections. Multiple instances of filter objects are created and used at once if a filter is marked free-threaded or apartment. Filtering and hit-highlighting are much more efficient when a filter is free-threaded or apartment, enabling multiple instances to run at once. The "Installing Filters" section, later in this article, discusses how to specify the threading model of your filter in the registry.

If you cannot make your filter free-threaded, be sure to mark it Single in the ThreadingModel to prevent crashes. For performance reasons, you should do this only as a last resort. If you do not specify a threading model in the registry then it is assumed to be Single.

MSSearch products do not have multiple threads that use a single filter instance at the same time; each thread has its own instance of a filter.

To make a filter multi-threaded, you can put all global variables in the filter object class to ensure that there is no contention over the state and each instance works independently.

MSSearch-based products load filters in a special process called a filter daemon. To debug your filter, hook up a debugger to one of the processes listed in the following table.

Table 6. Filter daemon processes for specific applications

Application Filter daemon process to use
SharePoint Portal Server, Exchange Server, or SQL Server Mssdmn.exe
Indexing Service Cidaemon.exe

Ordinarily, if a filter daemon process is non-responsive for a few minutes, it is terminated, with the assumption that a filter bug caused the problem. When a filter daemon process is being debugged, the idle check is automatically disabled. Some older versions of Indexing Service do not have this debugger check, so you must set the registry key HKLM\System\CurrentControlSetControl\ContentIndex\FilterIdleTimeout REG_DWORD 6000000 (decimal), which extends the timeout to one hundred minutes.

To find the process that has your filter loaded definitively, you can use the debugger tool tlist as follows: tlist –m yourfilter.dll.

Filters must be prepared to have their final Release() method called at any time. Do not assume that the entire contents of a file is consumed before the filter is destroyed. Release() may be called early in error conditions or it might also be called early in a variety of other conditions such as when the application shuts down or when large documents are included in the full-text index.

Filters must return valid HRESULTs that reflect proper error handling. If the filter encounters an error, such as an out-of-memory error, the method in which the failure occurred must return the error. Failure to be diligent will result in user confusion when the user assumes the index is fully up-to-date, yet queries do not return the anticipated results. Be careful not to mask true error conditions by mapping the error code into a generic error such as E_FAIL.

Filters must never open files for write access. A filter that calls CreateFile() requesting write access will cause a deadlock in the filter daemon because the filter daemon takes an opportunistic lock (oplock) on files before invoking the filter. If some other application attempts to open the file for write access, the oplock breaks, causing the filter daemon to close its handles and reschedule indexing of the file to some later time. The deadlock happens if a filter opens a file for write access, because it blocks in that call until the oplock break is serviced. However, it cannot be serviced because the thread is busy waiting for the open to occur. If your filter works well in a stand-alone program, such as filtdump.exe, but halts in the filter daemon, this is likely the cause.

Filters for files that can contain embedded documents are responsible for loading and calling filter DLLs for the embedded documents. The BindIFilterFromStream() function, documented in the Platform SDK, makes this straightforward. With respect to queries, all text from an embedded document is treated as if it was in the hosting document.

Filter APIs that are documented in the Platform SDK such as LoadIFilter() and BindIFilterFromStream() work across all versions of MSSearch. Products such as SharePoint Portal Server override the default implementation of these functions, but this is invisible to filter writers.

The IPersist family of Load() methods can be called multiple times on a single instance of a filter. Filters should be prepared for this by freeing any existing allocated resources for the previous file being filtered and getting ready to filter the next file.

Design your filters to handle low memory conditions gracefully. MSSearch products may consume significant amounts of memory causing the process to run potentially short of memory. Your filter should anticipate this and return appropriate error codes. If a document fails to index because there is insufficient memory, the administrator must have a way to be aware of this fact so appropriate action can be taken.

Note   Consider digitally signing your filter DLL. Future versions of MSSearch-based products may have security features that make this a requirement by default. Products are more secure if they only load signed binaries that are trusted by the system administrator. You are ready for any future considerations if you sign your DLLs now.

Installing Filters

Beyond implementing the interfaces described earlier, filters should implement self-registration for installation. COM dictates that DLLs export the functions DllRegisterServer and DllUnRegisterServer for purposes of installing and uninstalling. You register filters similarly to how you register a COM object, and do a little extra work to associate file types with the filter. All installation information is written to the system registry.

The registration information described here is for Indexing Service. Other MSSearch-based products have their own registration model for filters. Most of these products, however, fall back on the Indexing Service registration for file extensions not covered in their configuration. Some of these products may have to be configured to do so. If your filter is targeting just one MSSearch-based product, consult the product's documentation to see how it should be registered. Otherwise, register your filter for Indexing Service so all applications can use it.

Note   SharePoint Portal Server 2003 does not use the same filters as Indexing Service. SharePoint Portal Server installs its own versions of the filters provided with Windows in the registry at:

HKLM\software\Microsoft\SPSSearch\ContentIndexCommon\Filters

SharePoint Portal Server looks at this location in the registry first, then falls back to look at the Indexing Service 3.0 registry model. This enables SharePoint Portal Server to load filters written by ISVs for Indexing Service. The layout of the registry for SharePoint Portal Server is documented in the SharePoint Products and Technolgoies 2003 SDK and is not duplicated in this article because ISVs can simply code to the Indexing Service 3.0 model.

Filters operate over specific types of files. File types are defined primarily by file extension. When an application such as Indexing Service is ready to filter a file, it looks in the registry under the file's extension to determine which filter to load. It then follows a series of registry links to find the name of the filter DLL. Installation of a filter involves writing to the registry the association between file type and filter DLL.

There are three CLSIDs for every filter (use uuidgen.exe to generate them for your filter):

  • The first is the CLSID of your file type, returned by IPersistFile::GetClassID as discussed earlier. In the sample filter, this is CLSID_CSmpFilter, which is {8B0E5E70-3C30-11d1-8C0D-00AA00C26CD4}.

  • The second CLSID identifies the persistent handler for the file type. The sample filter uses {8B0E5E73-3C30-11d1-8C0D-00AA00C26CD4}. The persistent handler key contains a list of persistent handlers registered for the class. Persistent handlers allow an object to be created to operate on persistent data without loading a full application. For example, if you were filtering a Microsoft Office Word document, it would be unacceptable performance if Word had to start. Instead, you just need to load a small DLL that understands the file structure of Word documents. IID_IFilter is the particular persistent handler that is important here, which is {89BCB740-6119-101A-BCB7-00DD010655AF}. This CLSID is constant for all filters, because they all implement IFilter.

  • The value of the IID_IFilter key is the third unique CLSID for a filter. This CLSID is the object that implements IFilter for the file type, in this case {8B0E5E70-3C30-11d1-8C0D-00AA00C26CD4}. This key contains an InprocServer32 value that specifies the DLL name and threading model. If the filter is in the system path, like the system32 directory, a file name is sufficient. If the filter is not in the system path, this value should have a full path specification.

    Note   In future versions of Microsoft operating systems it may get more difficult to install files in the system32 directory, so it is best to install them under Program Files and include a full path to the filter in the registry. For security reasons, it is also good to specify a full path to your DLL in the registry. If there is no full path, it is possible a "Trojan horse" version of your DLL could be loaded if it happens to be in the process path before your version.

Index Server 2.0 Filter Registration Model (Deprecated)

When an application loads a filter for an .smp file, it looks up the .smp key's value and finds SmpFilt.Document. It goes to this key and loads CLSID value, then opens that key to find the PersistentHandler value. It opens this key and scans the values under PersistentAddinsRegistered looking for IID_IFilter. If found, it opens the value from this key and looks for InprocServer32 to find the name and threading model of the filter DLL. Your filter should only register in this manner to support Microsoft Windows NT® 4.0. New filters either just register in the newer way described in this article, or register dynamically based on the Windows version number.

Relative to HKEY_CLASSES_ROOT, the sample filter's registry entries look like this on Windows NT 4.0:

.smp <No Name>: REG_SZ: SmpFilt.Document
SmpFilt.Document <No Name>: REG_SZ: Sample FilterDocument
   CLSID <No Name>: REG_SZ: {8B0E5E72-3C30-11d1-8C0D-00AA00C26CD4}
CLSID
   {8B0E5E72-3C30-11d1-8C0D-00AA00C26CD4}
   PersistentHandler <No Name>: REG_SZ: {8B0E5E73-3C30-11d1-8C0D-00AA00C26CD4}
{8B0E5E73-3C30-11d1-8C0D-00AA00C26CD4} <No Name>: REG_SZ: SmpFilt Persistent Handler
   PersistentAddinsRegistered
      {89BCB740-6119-101A-BCB7-00DD010655AF} <No Name>: REG_SZ: 
{8B0E5E70-3C30-11d1-8C0D-00AA00C26CD4}
{8B0E5E70-3C30-11d1-8C0D-00AA00C26CD4} <No Name>: REG_SZ: Sample Filter
   InprocServer32 <No Name>: REG_SZ smpfilt.dll
                ThreadingModel: REG_SZ Both

Indexing Service 3.0 and later and Other MSSearch-based Products Filter Registration Model

The registration of filters is slightly different for Indexing Service 3.0, which is built into Microsoft Windows 2000. All current MSSearch-based products adhere to this new format. Filters in this scheme require only two unique CLSIDs.

Relative to HKEY_CLASSES_ROOT, the sample filter's registry entries look like the following when registered on Microsoft Windows 2000.

.smp
   PersistentHandler <No Name>: REG_SZ: {8B0E5E73-3C30-11d1-8C0D-00AA00C26CD4}
CLSID
   {8B0E5E73-3C30-11d1-8C0D-00AA00C26CD4} <No Name>: REG_SZ: SmpFilt Persistent Handler
      PersistentAddinsRegistered
         {89BCB740-6119-101A-BCB7-00DD010655AF} <No Name>: REG_SZ: {8B0E5E70-3C30-11d1-8C0D-00AA00C26CD4}
{8B0E5E70-3C30-11d1-8C0D-00AA00C26CD4} <No Name>: REG_SZ: Sample Filter
   InprocServer32 <No Name>: REG_SZ smpfilt.dll
                ThreadingModel: REG_SZ Both

This format is similar to the older Index Server 2.0 format, but is more direct. You do not need the file format CLSID in the new layout, though it is still supported for backward compatibility if it exists. When loading an IFilter for a file with the .smp extension, Indexing Service goes to the .smp key and looks at the PersistentHandler value. From this value, it traverses the PersistentAddinsRegistered looking for IID_IFilter. If it finds IID_IFilter, it opens the value from this key and looks for InprocServer32 to find the name and threading model of the filter DLL.

The sample filter includes a C macro called DEFINE_DLLREGISTERFILTER in the file filtreg.hxx that makes registering filters easy. The macro generates the functions DllRegisterServer and DllUnregisterServer. It also detects the operating system version on which the filter is running so that it installs appropriately, in either the old or the new format as described earlier.

The following shows how the sample filter uses the macro.

SClassEntry const asmpClasses[] =
{
  { L".smp",
   L"SmpFilt.Document",
   L"Sample Filter Document",
   L"{8B0E5E72-3C30-11d1-8C0D-00AA00C26CD4}",
   L"Sample Filter Document"
  }
};

SHandlerEntry const smpHandler =
{
  L"{8B0E5E73-3C30-11d1-8C0D-00AA00C26CD4}",
  L"SmpFilt Persistent Handler",
  L"{8B0E5E70-3C30-11d1-8C0D-00AA00C26CD4}"
};

SFilterEntry const smpFilter =
{
  L"{8B0E5E70-3C30-11d1-8C0D-00AA00C26CD4}",
  L"Sample Filter",
  L"smpfilt.dll",
  L"Both"
};

DEFINE_DLLREGISTERFILTER( smpHandler, smpFilter, asmpClasses )

To use the macro in your filter, replace the names and CLSIDs from this sample with those that match the filter. The use of "Both" in the previous macro refers to the threading model for the filter. This means that the filter can be used in apartment and free-threaded models.

Note   If your filter is not thread-safe, be sure to specify "Apartment." Otherwise, the filter will cause the search indexing process to crash. Apartment model filters are not loaded on Microsoft Windows XP, and with products like SQL Server, using apartment model filters can cause crawling to take orders of magnitude longer. Making and marking your filters free-threaded is strongly recommended for system robustness and performance.

Use the command regsvr32 myfilter.dll to register a filter DLL and regsvr32 -u myfilter.dll to uninstall it. You should test your filter registration and unregistration in a debugger, especially if you added any custom code to the registration. Regsvr32 does not always complain when there is a problem with registration.

Each time the Indexing Service process starts, it looks at the registry value for HKLM\System\CurrentControlSet\Control\ContentIndex\DllsToRegister. This is a multi-sz value that contains a list of the DLLs for which DllRegisterServer is invoked. This feature was added to ensure filters and word breakers remain registered. Some applications inadvertently remove IFilters during their installation for those file types they work with. This is generally just a problem with file extensions that are shared by multiple applications. Adding your filter to this list when your filter is installed will help ensure your filter stays installed.

Resources

The Microsoft Platform SDK contains overview and reference material for IFilter plus two sample filters. The first is the simple text filter described in this article.

The second sample is an HTML Meta tag property filter. This filter layers itself on top of the HTML filter that ships in MSSearch-based products and converts Meta properties from strings to configurable data types, enabling advanced searching, sorting, and property retrieval. This filter is described in detail in the MSDN article Using HTML Meta Properties with Microsoft Index Server. This sample filter is especially useful for getting a better handle on how to use the IFilter::GetValue() method, which does not do as much in the simple text filter described in this article.

Testing Filter DLLs

The Platform SDK includes three tools that you can use to help test filter DLLs.

  • iFilttst.exe helps validate a filter by calling IFilter methods and checking the results for compliance with the specification. For example, the test looks for unique and increasing chunk IDs, consistent IFilter behavior after re-initialization, and that IFilter method calls with invalid parameters return expected error codes. Documentation for ifilttst.exe is included in the Platform SDK.
  • filtdump.exe takes as an argument the name of a file to filter, loads the filter, and prints output from a filter DLL. The filtdump.exe file uses the LoadIFilter() API documented in the Platform SDK to load the filter appropriate for the file specified on the command line. For example, the command filtdump myfile.smp instructs filtdump.exe to load the smpfilt.dll, retrieve all text and properties from the filter, and print the results.
  • Filtreg.exe inspects filter installation information in the registry. Filtreg.exe enumerates all file extensions that have filters associated with them and prints the file extension and DLL name of the filter. This is a good way to verify that you have installed the filter correctly.

Because filters are used in a variety of MSSearch products, you should try to test with more than one.

Suggested IFilter Tests

  • Make sure iFilttst.exe does not stop responding by checking the processor usage while it runs. This can help you ensure that no deadlocks occurred.
  • Make sure output chunks are marked with the right LCID if the filter supports multiple languages. Test all languages supported by the filter.
  • Test the filtering of very large documents to ensure the filter works as expected.
  • Test getting output from every single property supported by the file type, for example, that Word has Headers, Notes, Text Boxes and so on.
  • Use big property values, for example, use a large metatag in HTML documents.
  • Check that the filter does not leak file handles by editing it after getting output from the filter, or by using a tool like oh.exe before and after filtering a file. For more information on oh.exe, see Windows 2000 Resource Kit Tool: Open Handles (oh.exe).
  • If the file type supports embedded documents, test getting output from them. Consider creating a document with different types of nested embeddings.
  • Test all file types associated with filter, for example, check that the HTML filter works with .htm and .html file types. The filtreg.exe file in the Indexing Service SDK can be used to print a complete list of file extensions that have associated filters.
  • Test with corrupted files. Filters should fail gracefully.
  • If an application supports encryption, test that the filter does not output encrypted text.
  • Use multiple special Unicode characters in the file contents and test for their output. The following figure provides a sample of Unicode characters to test.

Figure 1. Unicode characters to test

Setup Tests

  • Installation must recover from failed installations, for example, from canceling and then restarting setup.
  • Uninstall must delete all files associated with the filter.
  • Uninstall must not delete files other than the ones associated with the filter installation.
  • Registry keys associated with the filter must be removed when uninstalled.
  • Uninstall must work even if files are deleted from the installation directory.

Additional References for Testing

For information about testing the IFilter interface, see the IFilter topic in the Windows Platform SDK.

For information about testing filters, see Testing Filters in the Platform SDK.

Conclusion

As the information in a site increases in today's information explosion, people rely heavily on a search mechanism to find the documents they need quickly and easily. We examined the sample filter implementation and worked with a copy that you can use for reference.

A filter is a COM object that implements the IFilter interface and one or more of the IPersist family of interfaces. Your filter object invokes methods on these interfaces to retrieve the text and properties of documents. Writing an IFilter is the only way you can ensure that your file format is included in the full-text index. You have learned how to write a filter that search applications can use and to understand COM interface filter support, filter implementation, and filter installation and testing.

Writing an IFilter object for your particular file format is the best way to ensure that your customers can find their data using MSSearch-based products, such as, Microsoft Office SharePoint Portal Server 2003, Microsoft SQL Server, Microsoft Exchange Server, and Microsoft Indexing Service. Microsoft Search (MSSearch)-based products use filters to extract the content and properties of files for inclusion in a full-text index.