How to Configure the Website Connector Using Web.config
Setting "EnableSourceSystemAPILogging" to "True" will cause the connector to crash as it does not support this feature
The following settings (default values associated) can be changed in the web.config file:
- MaxCrawlDepth: 0 (ignores max depth)
Indicates the depth at which the crawler stops when crawling hyperlinks starting with the specified website URL.
For example, a value of 2 means the connector crawls the root pages configured in the connection Connection defines the how Connectivity Hub connects to your Source System (which contains your documents, graphics, etc.,). Your Connection includes identifying elements such as: URL of the BA Insight web service connector you are using, (File Share connector, SharePoint Online connector, etc.), Authentication mode, User Accounts and Credentials, Database information (for database connectors) settings, plus any page linked from that root and then stops; no further links are crawled.
Entering a value of 0 for this setting can result in the connector retrieving all web pages found across the site which may result in longer crawl times.
The MaxCrawlDepth does not overwrite other filters.
-
Accept:
-
text/html
-
application/xhtml+xml
-
application/xml;q=0.9,*/*;q=0.8
-
- UserAgent:
- Mozilla/5.0 (Windows NT 6.1; WOW64)
- AppleWebKit/534.24 (KHTML, like Gecko)
- Chrome/11.0.696.71 Safari/534.
- AcceptEncoding: gzip,deflate,sdch
- CrawlRequestTimeoutInMinutes: 30
- Sets the crawl processing timeout, in minutes, of the engine
- Sets the crawl processing timeout, in minutes, of the engine
- AdditionalRequestTimeInMinutesOnRetries: 5
- This setting represents additional minutes added on successive re-tries to the original CrawlRequestTimeoutInMinutes of items which Timeout. Example: If the first request times out in 30 mins the 1st retry will wait 35, the 2nd retry 40....
- This setting represents additional minutes added on successive re-tries to the original CrawlRequestTimeoutInMinutes of items which Timeout. Example: If the first request times out in 30 mins the 1st retry will wait 35, the 2nd retry 40....
- MaximumNumberOfCrawlRequestsToCreatePerBatch: 1000
- MaximumNumberOfCrawlThread: 10
- Sets the number of threads the crawl engine uses when processing hyperlinks
- Sets the number of threads the crawl engine uses when processing hyperlinks
- HttpWebRequestRetries: 5
- OverrideDatabaseConfigurationWithXmlConfiguration: True
- When set to true uses the arachnode configuration files instead of the configuration options from the arachnode tables to configure the crawl engine
- When set to true uses the arachnode configuration files instead of the configuration options from the arachnode tables to configure the crawl engine
- MemoryCacheTimeLimitHours: 8
- Specifies how long an item remains in the memory cache if not removed by the Connector.
- Specifies how long an item remains in the memory cache if not removed by the Connector.
- InsertArachnodeCrawlDataIntoDatabase: False
- If set to true, stores and uses crawl information in the arachnode database.
- Important: This impacts crawl performance
- HyperLinkRegex:
- The setting contains the default value of the Hyperlink Regex pattern which is used if the pattern is removed from the Connection Configuration page.
Additional Filters and Crawl Engine Settings
Crawl engine behavior, listed below, can be modified in the arachnode configuration files. The XML structure of the files and the specified rules for editing each file must be respected.
- Changing web page filters
- Adding or removing file extensions
- Adding new content types
- Honoring robots.txt
The following files can be found and modified in the Connector installation directory under the /Arachnode folder:
- cfg.ContentTypes.xml:
- This file is used to add supported mime types
- ID and Name values must be unique inside the file
- ContentTypeTypeID can have the following numbers as values:
- 1 – Application
- 2 - Audio
- 4 – Images
- 5 – Messages
- 6 – Models
- 7 – Multipart messages
- 8 - Text
- 9 - Videos
- cfg.AllowedDataTypes.xml:
- An XML file that matches supported mime types with file extensions
- ID must be unique inside the file
- ContentTypeID must be a valid mime type ID found in the ContentTypes.xml file
- FullTextIndexType must contain the file extension (.docx, .pdf)
- Multiple extensions can be associated with a single mime type by using the comma sign (,) as a separator character
- An extension can be associated with multiple mime types by adding multiple entries, each one referencing a separate ID from the cfg.ContentType.xml file
- DiscoveryTypeID: Values are set as one or more of the following numbers:
- Values 0, 1, 2, 3, 5 are automatically given depending on the results of the Web Request
- 4 - File
- 6 – Image
- 7 – Web page
- cfg.DisallowedWords.xml:
- Web pages that contain words from this file (in header, URL, or content) will be ignored
- Web pages that contain words from this file (in header, URL, or content) will be ignored
- cfg.CrawlRules.xml:
- A list of rules that can be applied at different moments during the web request processing to specify if a URL is allowed or not
- If can be used to enable rules based on robots.txt, named anchors, query strings, web request frequency.
- cfg.AllowedSchemes.xml
- cfg.DisallowedFileExtensions.xml
- cfg.DisallowedExtensions.xml
- cfg.DisallowedDomains.xml
- cfg.DisallowedSchemes.xml
- cfg.DisallowedHosts.xml
- dbo.DisallowedAbsoluteUris.xml