Defining Crawl Rules and File Types

In Enterprise Search in Microsoft Office SharePoint Server 2007, you use crawl rules and extensions to define how a particular set of content from a content source should be crawled.

Crawl Rules

Crawl rules provide you with the ability to set the behavior of the Enterprise Search index engine when you want to crawl content from a particular path. By using these rules, you can:

  • Prevent content within a particular path from being crawled.

    For example, in a scenario in which a content source points to the URL path such as https://www.microsoft.com/, but you want to prevent content from the "downloads" subdirectory https://www.microsoft.com/downloads/ from being crawled, you would set up a rule for the URL, with the behavior set to exclude content from that subdirectory.

  • Indicate that a particular path that would otherwise be excluded from the crawl should be crawled.

    Using the previous scenario, if the downloads directory contained a directory called "content" that should be included in the crawl, you would create a crawl rule for the following URL, with the behavior set to include the "content" subdirectory https://www.microsoft.com/downloads/content.

Note

This only applies to HTTP content.

  • Specify authentication credentials.

    You would use this rule for a scenario in which the content being accessed requires credentials different from what is specified for the default content access account.

You can use the asterisk (*) as a wildcard character in crawl rules, for example:

http://*.microsoft.com/*.html

Note

Do not use rules as another way of defining content sources or providing scope. Instead, use rules to specify more details about how to handle a particular set of content from a content source.

Crawl Rule Order

Rule order is important, because the first rule that matches a particular set of content is the one that is applied. So in the previous example, because the rule excluding .aspx pages is listed first, any time the crawler encounters an .aspx page within http://hostname, the page is excluded—even though it matches both rules and no other rules are applied.

Crawl Rule Object Model

Individual crawl rules are represented by the CrawlRule class. The full set of crawl rules is contained in the CrawlRuleCollection class. By using the CrawlRuleCollection class, you can add new crawl rules with the Create method, set the priority of an existing crawl rule with the SetPriority method, and test a URL or path against all the crawl rules to determine which one will apply with the Test method.

For updates or to test individual crawl rules, use the CrawlRule object. You would also use this object to specify content access credentials to use for content that matches that rule, or if you want to delete the rule.

File Types

The file type inclusions/exclusions list contains the list of extensions that identify which file types the crawler should include or exclude from the index. For the crawler to extract the contents and properties of a particular type of file, a filter for that file type must be installed on the server on which the index service is running.

You can also use the list to exclude a particular file type, even if there is an installed filter associated with that file type.

File Type Object Model

Individual file name extensions are represented by the Extension class. You can use this object to remove a file name extension. Extensions are grouped within an ExtensionCollection object. Use the Create method to specify a new file name extension.

See Also

Tasks

How to: Return the Search Context for the Search Service Provider

Reference

Microsoft.Office.Server.Search.Administration.CrawlRule
Microsoft.Office.Server.Search.Administration.Extension

Concepts

Managing Content
Getting Started with the Enterprise Search Administration Object Model