Website Connector Troubleshooting

Logging

The Website Connector has a custom log in the Logs folder under the installation target A Target is a "pointer" to a specific instance of a search application, such as Elasticsearch. - (A Search application instance has one or more indexes) location. By default, the logging level is set to INFO. If you want to troubleshoot, set the log level to DEBUG in the Logging.config file in the installation folder.

To separate Connector logging from the crawling engine logging, edit the Logger.config file and add the following tag:

Logger.config
Copy
<logger name="WebCrawlerLoggerManager">
    <level value="<LogLevel>"/>
    <appender-ref ref="BufferingForwarder" />
</logger>

A separate log file for the engine can also be added by adding a new appender for the crawling engine logger. To determine if a file is restricted from crawl check the logs for IsDisallowedReason. Files may not be included in the crawl due to restrictions applied to the extensions, content type, content, scheme, domains, and hosts.

Common troubleshooting remedies

100% CPU

If 100% CPU is observed, make sure there is a 100 millisecond delay between crawls by adding it to the settings JSON: "crawlDelay": 100,

Out of Memory

If the server memory consumption is between 90% and 100%, reduce the number of parallel crawls by 1 by changing:

  1. In connector settings JSON set the "ParallelCrawlsPerSite" setting to 1 less than the previous value.

  2. On the Web Service Connection Connection defines the how Connectivity Hub connects to your Source System (which contains your documents, graphics, etc.,). Your Connection includes identifying elements such as: URL of the BA Insight web service connector you are using, (File Share connector, SharePoint Online connector, etc.), Authentication mode, User Accounts and Credentials, Database information (for database connectors) page set the "max concurrent requests" setting to 1 less than the previous value.

Once the above settings are applied see if the system load is more appropriate and start increasing the number of parallel crawls until the resources are as desired.

Worker Calculation

Use this number to calculate the optimum number of workers :

  • Each worker can take from 256 to 1000 MB of Java memory + 30 to 80 MB of NodeJS + 30-40 MB for Chromium.

  • In average normal usage is 300 MB of java.exe + 70 MB NodeJS + 30 MB Chromium = 400 MB cumulative memory per worker.