About the Website Connector

The Website connector is used to crawl web pages and documents from any given website.

  • Starting with a given URL, the connector goes through the web page and recursively indexes the URLs found inside.
  • The Website Connector supports various authentication mechanisms when accessing websites.

Authentication

The Website connector accesses sites using the following authentication methods:

Capabilities and Limitations

  • The connector honors the robots.txt and site map files if found
  • The connector renders the pages crawled and executes any JavaScript found.
    • As a result, the time taken to crawl each per page can be significant, and the crawl speed varies greatly depending on the complexity of the pages to index.
    • The decision that the page load is complete is based on networkidle0 property which tells the connector to consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.
  • The connector will not crawl links with a hash mark, number sign, or pound sign ( # ).
  • Links are collected by calling querySelectorAll('a[href]') on a page.
Web Applications
When indexing web applications with the connector, make sure the account used to crawl the connector has NO WRITE permission.
Since all pages are rendered in a headless browser before indexing, any link triggering actions such as add, edit, or delete may be detected by the connector and accidentally trigger.
Alternatively, any add/edit/delete action should be implemented via JavaScript click events rather than HTML A tags.