Website Connector Troubleshooting

Logging

The Website Connector has a custom log in the Logs folder under the installation target location.

By default, the logging level is set to INFO.

If you want to troubleshoot, set the log level to DEBUG in the Logging.config file in the installation folder.

To separate Connector logging from the crawling engine logging, edit the Logger.config file and add the following tag:

Logger.config
Copy
<logger name="WebCrawlerLoggerManager">
    <level value="<LogLevel>"/>
    <appender-ref ref="BufferingForwarder" />
</logger>

A separate log file for the engine can also be added by adding a new appender for the crawling engine logger.

To determine if a file is restricted from crawl check the logs for IsDisallowedReason.

Files may not be included in the crawl due to restrictions applied to the extensions, content type, content, scheme, domains, and hosts.

Info: 100% CPU

If 100% CPU is observed make sure there is a 100 millisecond delay between crawls by adding it to the settings JSON:

 "crawlDelay": 100,

Out of Memory

If the server memory consumption getting in between 90% and 100%, reduce the number of parallel crawls to 1 by changing:

  1. In connector settings JSON set: "ParallelCrawlsPerSite": 1

  2. On the Web Service Connection page set the "max concurrent requests" setting to 1.

Once the above settings are applied see if the system load is more appropriate and start increasing the number of parallel crawls until the resources are as desired.

Worker Calculation

Use this number to calculate the optimum number of workers :

  • Each worker can take from 256 to 1000 MB of Java memory + 30 to 80 MB of NodeJS + 30-40 MB for Chromium.

  • In average normal usage is 300 MB of java.exe + 70 MB NodeJS + 30 MB Chromium = 400 MB cumulative memory per worker.