About the Website Connector
The Website connector is used to crawl web pages and documents from any given website.
- Starting with a given URL, the connector goes through the web page and recursively indexes the URLs found inside.
- The Website Connector supports various authentication mechanisms when accessing websites.
Authentication
The Website connector accesses sites using the following authentication methods:
- Public access
- Basic login
- Trusted certificate authentication
- oAuth Specifies a process for resource owners to authorize third-party access to their server resources without providing credentials. based authentication
- NTLM authentication
Capabilities and Limitations
- The connector honors the robots.txt and site map files if found
- The connector renders the pages crawled and executes any JavaScript found.
- As a result, the time taken to crawl each per page can be significant, and the crawl speed varies greatly depending on the complexity of the pages to index.
- The decision that the page load is complete is based on networkidle0 property which tells the connector to consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.
- As a result, the time taken to crawl each per page can be significant, and the crawl speed varies greatly depending on the complexity of the pages to index.
- The connector will not crawl links with a hash mark, number sign, or pound sign ( # ).
- Links are collected by calling
querySelectorAll('a[href]')
on a page.
Web Applications
When indexing web applications with the connector, make sure the account used to crawl the connector has NO WRITE permission.
Since all pages are rendered in a headless browser before indexing, any link triggering actions such as add, edit, or delete may be detected by the connector and accidentally trigger.
Alternatively, any add/edit/delete action should be implemented via JavaScript click events rather than HTML A tags.