About the Website Connector

The Website connector is used to crawl web pages and documents from any given website.

Starting with a given URL, the connector goes through the web page and recursively indexes the URLs found inside.
The Website Connector supports various authentication mechanisms when accessing websites.

Authentication

The Website connector accesses sites using the following authentication methods:

Public access
Basic login
Trusted certificate authentication
oAuth Specifies a process for resource owners to authorize third-party access to their server resources without providing credentials. based authentication
NTLM authentication

Capabilities and Limitations

The connector honors the robots.txt and site map files if found
The connector renders the pages crawled and executes any JavaScript found.
- As a result, the time taken to crawl each per page can be significant, and the crawl speed varies greatly depending on the complexity of the pages to index.
- The decision that the page load is complete is based on networkidle0 property which tells the connector to consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.
The connector will not crawl links with a hash mark, number sign, or pound sign ( # ).
Links are collected by calling querySelectorAll('a[href]') on a page.

Web Applications

When indexing web applications with the connector, make sure the account used to crawl the connector has NO WRITE permission.

Since all pages are rendered in a headless browser before indexing, any link triggering actions such as add, edit, or delete may be detected by the connector and accidentally trigger.

Alternatively, any add/edit/delete action should be implemented via JavaScript click events rather than HTML A tags.