Website Connector Prerequisites

Connector Features and Requirements

Features Supported Additional Information
Searchable content types Yes

All content types.

Meta tags found in HTML documents can be extracted via Connectivity Hub.

See Connectivity Hub document on how to configure your content sources for this.

Content Update Full and Incremental

Since websites do not have APIs to report changes, incremental crawls perform a full rescan of the website, accessing each and every page. However, only updated pages will be further processed.

Updated pages are identified by the etag or last-modified HTTP headers. If either or both of these change, the page will be considered updated.

Permission Types No

All content is indexed as public.

If you wish to assign security, you can do so via the ACL Script in the Content > Advanced tab

Required Software .NET Framework v4.7.2
Hardware

Rending HTML web pages requires a large amount of CPU resources and memory.

BA Insight recommends the following hardware:

  • Server with at minimum of 5 GB RAM and 8 CPU cores available for the connector to process sites correctly.

Authentication Protocols

The following Authentication protocols are supported

Authentication Protocol  Description Prerequisite
Anonymous Access

The connector will not pass any information to the web server

None

HTTP Basic Authentication

The connector will pass the username and password for authentication via the standard HTTP Headers

The username/ password of the account to use for authentication

Azure AD Application

The connector will interact with Azure Active Directory to obtain a token and pass it as the HTTP Authorization header

The website must be secured via Azure AD

The connector requires:

  • ID of the Azure tenancy where the website is deployed
  • Client ID of the application in Azure AD
  • A certificate with which to obtain an Azure AD token to be uploaded in the certificate store on the computer where the connector is installed.
oAuth Authentication

The connector interacts with the identity provider to obtain refresh, access and ID tokens to use for authentication.

The access and ID tokens will be provided to a bootstrapping page on the website for initialization

The Application used by the website must be configured as follows:

  • Allow PKCE authentication code flow
  • Provide refresh, access and ID tokens
  • Add the Connector oAuth Redirect Url to the list of authorized Redirect URLs. Typically: http://localhost:2406/oauthresult.aspx
    Please note that the redirect url is case-sensitive and must correspond to the exact same way the connector will be accessed. The /oauthresult.aspx part of the url will always be in lower case.

The website must be modified to add an extra page  to initialize the application for the purpose of crawling.

When the website is crawled, this page will be called with the ID or Access tokens passed via the URL. The page is then responsible for storing the necessary token in the right location so that the crawling account is considered as successfully authenticated and the browser will not prompt for authentication

Additionally, make sure you have the following information before starting the installation and configuration:

  • Client IDof the application used by the website
  • Authentication endpoint of the identity server