Introduction to Protocol Handlers

SharePoint Portal Server allows for external content sources to be added to the workspace and crawled. Protocol handlers are software components of the Filter Daemon that implement the protocol for accessing a content source in its native format. This exposes it to be crawled by the Search service. Figure 12 illustrates the protocol handler architecture and the data flow during the crawl process.

Figure 12: Protocol Handler Architecture

Crawls are initiated within the Gatherer process for a SharePoint Portal Server workspace. The Gatherer receives a URL for content that must be crawled. The URL can be the start address for a content source, a link stored from a previous crawl, or a notification from a SharePoint Portal Server workspace. The Gatherer checks the URL against the crawl restrictions set for this workspace.

When crawling of a content source starts, a crawler or robot thread in the Gatherer gives the crawling request to the Filter Daemon. The robot thread allocates a Filter object from a pool. When the Filter object is allocated, it is associated with a Filter thread object. Each document being filtered corresponds to one Filter thread in the Filter Daemon. The Filter Daemon runs in a separate process from the Gatherer so that the process can be terminated if it encounters errors or looping. The Filter Daemon and the Gatherer communicate by using pipes of shared memory.

The Filter thread receives the URL for content to be filtered in addition to the last time the content was crawled. The Filter thread determines and invokes the appropriate protocol handler for the URL item. The protocol handler creates a UrlAccessor object that will control the filtering of this item.

URL items passed to the Filter thread can either be the start address for a content source, or the URL of an item inside the content source. In process 1, illustrated in the preceding figure, the Gatherer provides the Filter thread with the start address for a content source. The protocol handler produces an enumeration of items inside the content source. The Gatherer acts on this enumeration of items and determines which items need to be filtered. It associates a URL with each item, and queues the items for further examination by the Filter Daemon.

If the URL points to a specific item within the content source, data from the UrlAccessor object can follow one of two paths from the protocol handler to the Filter thread. These paths are labeled 2a and 2b in the preceding figure.

In process 2a in the preceding figure, the Filter thread was issued a URL for an item in the content source. The protocol handler processes the data in one of two ways. It can either pass the contents of the item in a stream it has open for the Filter thread in the Filter Daemon. This happens through the BindToStream method on the UrlAccessor. In this case, the Filter thread invokes an appropriate IFilter on the Stream object created for the document. Alternatively, the protocol handler returns the file name of the item pointed to in the URL. The Filter thread uses the file name to access the file directly and chooses an appropriate IFilter.

In process 2b, the UrlAccessor uses the ProtocolHandlerSite object to query the Filter Daemon on the appropriate IFilter to use for the URL item. The choice of IFilter is based on the file extension, a Class ID that identified the file's content in the registry, or on the MIME Content Type. The UrlAccessor object applies this IFilter on the URL item and returns the filtered data to the Filter thread.

After the Filter thread has established a connection to the IFilter for the item being accessed, the filtering process is the same for both protocol handler data paths. The IFilter process enters a loop of reading from the URL item and producing filtered data that is returned to the Gatherer process. The IFilter first extracts metadata that corresponds to properties that are marked retrievable in the SharePoint Portal Server schema, such as title, file size, and last modified date. Then it breaks the item content into chunks of text.

All errors that occur during this process are flagged in the Gatherer Log. Upon error, the filtering process may be terminated. The items that were being filtered when the Filter Daemon was terminated are re-queued by the Gatherer for later filtering.

Protocol Handler Initialization

Filter Daemon starts and initializes all protocol handlers that are registered. After the protocol handler is CoCreated, initialization is performed by a call to the Init method of the SearchProtocol object.

Protocol Handler Selection

When a crawl for a content source is initiated, the Gatherer determines the URL for the start address to the content source and passes this URL to the Filter Daemon. The Filter Daemon determines the appropriate protocol handler for the content source. Content source types are distinguished by their URL prefixes. For example, by the protocol name of the URL: If the URL is of the type http://www.microsoft.com, then the Filter Daemon uses the protocol handler associated with the HTTP protocol.

Protocol Handler Security

Security for the Gatherer is implemented through Microsoft Windows NT® Security Descriptors. Protocol handlers must, therefore, use domain groups and the Search service must exist in a trusted domain.

Protocol handlers receive security credentials for the content source they are accessing from the Filter Daemon. These credentials, in addition to the authentication method, are specified when the content source is created and configured on the workspace.

The protocol handler runs as a service in the local system if no access account is set.

Types of Protocol Handlers

SharePoint Portal Server provides support for hierarchical and link-based protocol handlers. Hierarchical protocol handlers work with structured content sources, such as file shares, that include structures such as directories or folders that must be traversed. Link-based protocol handlers work with content sources such as Web sites, where links within the content indicate how the source will be traversed.