How to Configure the Website Connector Using Web.config

Setting "EnableSourceSystemAPILogging" to "True" will cause the connector to crash as it does not support this feature


The following settings (default values associated) can be changed in the web.config file:

  • MaxCrawlDepth: 0 (ignores max depth)
  • Accept:

    • text/html

    • application/xhtml+xml

    • application/xml;q=0.9,*/*;q=0.8

  • UserAgent:
    • Mozilla/5.0 (Windows NT 6.1; WOW64)
    • AppleWebKit/534.24 (KHTML, like Gecko)
    • Chrome/11.0.696.71 Safari/534.

  • AcceptEncoding: gzip,deflate,sdch

  • CrawlRequestTimeoutInMinutes: 30
    • Sets the crawl processing timeout, in minutes, of the engine

  • AdditionalRequestTimeInMinutesOnRetries: 5
    • This setting represents additional minutes added on successive re-tries to the original CrawlRequestTimeoutInMinutes of items which Timeout. Example: If the first request times out in 30 mins the 1st retry will wait 35, the 2nd retry 40....  

  • MaximumNumberOfCrawlRequestsToCreatePerBatch: 1000

  • MaximumNumberOfCrawlThread: 10
    • Sets the number of threads the crawl engine uses when processing hyperlinks

  • HttpWebRequestRetries: 5
  • OverrideDatabaseConfigurationWithXmlConfiguration: True
    • When set to true uses the arachnode configuration files instead of the configuration options from the arachnode tables to configure the crawl engine

  • MemoryCacheTimeLimitHours: 8
    • Specifies how long an item remains in the memory cache if not removed by the Connector.

  • InsertArachnodeCrawlDataIntoDatabase: False
    • If set to true, stores and uses crawl information in the arachnode database.
    • Important: This impacts crawl performance

  • HyperLinkRegex:
    • The setting contains the default value of the Hyperlink Regex pattern which is used if the pattern is removed from the Connection Configuration page.

Additional Filters and Crawl Engine Settings

Crawl engine behavior, listed below, can be modified in the arachnode configuration files. The XML structure of the files and the specified rules for editing each file must be respected.

  • Changing web page filters
  • Adding or removing file extensions
  • Adding new content types
  • Honoring robots.txt

The following files can be found and modified in the Connector installation directory under the /Arachnode folder:

  • cfg.ContentTypes.xml:
    • This file is used to add supported mime types
    • ID and Name values must be unique inside the file
    • ContentTypeTypeID can have the following numbers as values:
      • 1 – Application
      • 2 - Audio
      • 4 – Images
      • 5 – Messages
      • 6 – Models
      • 7 – Multipart messages
      • 8 - Text
      • 9 - Videos

  • cfg.AllowedDataTypes.xml:
    • An XML file that matches supported mime types with file extensions
    • ID must be unique inside the file
    • ContentTypeID must be a valid mime type ID found in the ContentTypes.xml file
    • FullTextIndexType must contain the file extension (.docx, .pdf)
      • Multiple extensions can be associated with a single mime type by using the comma sign (,) as a separator character
      • An extension can be associated with multiple mime types by adding multiple entries, each one referencing a separate ID from the cfg.ContentType.xml file

  • DiscoveryTypeID: Values are set as one or more of the following numbers:
    • Values 0, 1, 2, 3, 5 are automatically given depending on the results of the Web Request
    • 4 - File
    • 6 – Image
    • 7 – Web page

  • cfg.DisallowedWords.xml:
    • Web pages that contain words from this file (in header, URL, or content) will be ignored

  • cfg.CrawlRules.xml:
    • A list of rules that can be applied at different moments during the web request processing to specify if a URL is allowed or not
    • If can be used to enable rules based on robots.txt, named anchors, query strings, web request frequency.

  • cfg.AllowedSchemes.xml
  • cfg.DisallowedFileExtensions.xml
  • cfg.DisallowedExtensions.xml
  • cfg.DisallowedDomains.xml
  • cfg.DisallowedSchemes.xml
  • cfg.DisallowedHosts.xml
  • dbo.DisallowedAbsoluteUris.xml