How to Configure the Website Connector Using Web.config

Setting "EnableSourceSystemAPILogging" to "True" will cause the connector to crash as it does not support this feature

The following settings (default values associated) can be changed in the web.config file:

MaxCrawlDepth: 0 (ignores max depth)
- Indicates the depth at which the crawler stops when crawling hyperlinks starting with the specified website URL.
- For example, a value of 2 means the connector crawls the root pages configured in the connection Connection defines the how Connectivity Hub connects to your Source System (which contains your documents, graphics, etc.,). Your Connection includes identifying elements such as: URL of the BA Insight web service connector you are using, (File Share connector, SharePoint Online connector, etc.), Authentication mode, User Accounts and Credentials, Database information (for database connectors) settings, plus any page linked from that root and then stops; no further links are crawled.
- Entering a value of 0 for this setting can result in the connector retrieving all web pages found across the site which may result in longer crawl times.
  The MaxCrawlDepth does not overwrite other filters.
Accept:
- text/html
- application/xhtml+xml
- application/xml;q=0.9,*/*;q=0.8
UserAgent:
- Mozilla/5.0 (Windows NT 6.1; WOW64)
- AppleWebKit/534.24 (KHTML, like Gecko)
- Chrome/11.0.696.71 Safari/534.
AcceptEncoding: gzip,deflate,sdch
CrawlRequestTimeoutInMinutes: 30
- Sets the crawl processing timeout, in minutes, of the engine
AdditionalRequestTimeInMinutesOnRetries: 5
- This setting represents additional minutes added on successive re-tries to the original CrawlRequestTimeoutInMinutes of items which Timeout. Example: If the first request times out in 30 mins the 1st retry will wait 35, the 2nd retry 40....
MaximumNumberOfCrawlRequestsToCreatePerBatch: 1000
MaximumNumberOfCrawlThread: 10
- Sets the number of threads the crawl engine uses when processing hyperlinks
HttpWebRequestRetries: 5
OverrideDatabaseConfigurationWithXmlConfiguration: True
- When set to true uses the arachnode configuration files instead of the configuration options from the arachnode tables to configure the crawl engine
MemoryCacheTimeLimitHours: 8
- Specifies how long an item remains in the memory cache if not removed by the Connector.
InsertArachnodeCrawlDataIntoDatabase: False
- If set to true, stores and uses crawl information in the arachnode database.
- Important: This impacts crawl performance
HyperLinkRegex:
- The setting contains the default value of the Hyperlink Regex pattern which is used if the pattern is removed from the Connection Configuration page.

Additional Filters and Crawl Engine Settings

Crawl engine behavior, listed below, can be modified in the arachnode configuration files. The XML structure of the files and the specified rules for editing each file must be respected.

Changing web page filters
Adding or removing file extensions
Adding new content types
Honoring robots.txt

The following files can be found and modified in the Connector installation directory under the /Arachnode folder:

cfg.ContentTypes.xml:
- This file is used to add supported mime types
- ID and Name values must be unique inside the file
- ContentTypeTypeID can have the following numbers as values:
  - 1 – Application
  - 2 - Audio
  - 4 – Images
  - 5 – Messages
  - 6 – Models
  - 7 – Multipart messages
  - 8 - Text
  - 9 - Videos
cfg.AllowedDataTypes.xml:
- An XML file that matches supported mime types with file extensions
- ID must be unique inside the file
- ContentTypeID must be a valid mime type ID found in the ContentTypes.xml file
- FullTextIndexType must contain the file extension (.docx, .pdf)
  - Multiple extensions can be associated with a single mime type by using the comma sign (,) as a separator character
  - An extension can be associated with multiple mime types by adding multiple entries, each one referencing a separate ID from the cfg.ContentType.xml file
DiscoveryTypeID: Values are set as one or more of the following numbers:
- Values 0, 1, 2, 3, 5 are automatically given depending on the results of the Web Request
- 4 - File
- 6 – Image
- 7 – Web page
cfg.DisallowedWords.xml:
- Web pages that contain words from this file (in header, URL, or content) will be ignored
cfg.CrawlRules.xml:
- A list of rules that can be applied at different moments during the web request processing to specify if a URL is allowed or not
- If can be used to enable rules based on robots.txt, named anchors, query strings, web request frequency.
cfg.AllowedSchemes.xml
cfg.DisallowedFileExtensions.xml
cfg.DisallowedExtensions.xml
cfg.DisallowedDomains.xml
cfg.DisallowedSchemes.xml
cfg.DisallowedHosts.xml
dbo.DisallowedAbsoluteUris.xml