How to Set Up and Configure the Website Connector

All BA Insight connectors can be downloaded from the Upland Right Answers Portal under Knowledge > BA Insight > Product Downloads > Connectors. This connector is installed with the same generic steps as any BA Insight connector. You must satisfy the Prerequisites for your connector before installing. The configuration specifics are detailed below.

Choosing the Right Security Model for the Connector
Connection Configuration Specifics
Content Configuration Specifics

Choosing the Right Security Model for the Connector

Procedure:

Select Connections from the top navigation menu to open the Connections page.
Select New > Web service connection.
The Connect to web service dialog appears.
Enter your Website connector URL into the Web service URL field.
1. You can retrieve your website service URL from IIS.
After you enter your Web service URL, click the Connect button.
The Services field appears. See the graphic below.
1. WebSite Connector for public sites: Select this option to use no authentication.
2. WebSite Connector for sites with basic login: Select this option to use basic authentication.
3. WebSite Connector for sites with Trusted Certificate Authentication: Select this option to authenticate using an Azure AD An identity and access management solution from Microsoft that helps organizations secure and manage identities for hybrid and multicloud environments. Application.
4. WebSite Connector for sites with OAuth Specifies a process for resource owners to authorize third-party access to their server resources without providing credentials. authentication: Select this option to authenticate using oAuth access or ID tokens
Select your Authentication mode.
1. Note that your Service account should be used, unless you require special considerations,
2. This account DOES NOT REQUIRE you enter a Login or Password.
3. The following accounts must be granted access to the web service:
  1. The account used to run the Job service.
  2. The account running the ConnectivityHub Admin site.
Click the Connect button at the bottom of the dialogue box.
Note that Web service URL field in the Connection Info tab is now populated to reflect your authentication mode.
Before proceeding, enter a name for your connection in the Title field.

Connection Configuration Specifics

Depending on the service you selected when you set up your connection, you must complete the applicable configuration steps:

General Settings WebSite Connector for public sites
General Settings WebSite Connector for sites with basic login
General Settings WebSite Connector for sites with Trusted Certificate Authentication
WebSite Connector for sites with OAuth authentication configuration

General Settings WebSite Connector for public sites.

Procedure:

Select the General Settings tab.
Note that the fields you see are based on the Authentication mode you selected earlier.
The General Settings tab shown below reflects the service "WebSite Connector for public sites."
- Max concurrent requests: Specify the maximum number of item data requests that can be processed in parallel by the connector. This field must be a positive number or be empty.
- Sites: These are the site URLs that will be crawled. The URLs must be in absolute form including scheme and domain.
- Use only user filters: Enable this checkbox to ignore content filters from Robots.txt
- Allow sub domains crawl: Enable this checkbox to crawl pages from site sub domains
- Settings: The Website connector supports additional settings during crawl to specify non-standard behavior of the operation:
  Copy
```
{
    "crawlDelay": 100,
    "timeout": 300000,   
    "retryCount": 3,
    "retryDelay": 10000,
    "followSitemapXmlOnly": false,
    "headers": {
        "X-API-Key": "abcdef12345"
    },
    "filters": {
        "allow": [
            "/"
        ],
        "disallow": [
            "*?utm_source"
        ]
    },
    "supportFileExtensions": [
      "pdf","doc","docx","rtf","txt","xls","xlsx","ppt","pptx"
    ],
    "ParallelCrawlsPerSite": 2,
    "supportFileExtensions": [
        "pdf",
        "doc",
        "xls"
    ]
    "excludedTags": [
        "header",
        "footer"
    ]
}
```
  - crawlDelay: this adds a delay, in milliseconds, before any call to target A Target is a "pointer" to a specific instance of a search application, such as Elasticsearch. - (A Search application instance has one or more indexes) site. The default value is 0.
  - timeout: This is the Single page crawl timeout in milliseconds. The default value is 5 minutes or 300000 ms.
  - retryCount: This specifies how many times the page will be retrieved in case of failure. The default value is 3.
  - retryDelay:This specifies how long to wait, in milliseconds, between retries in case of failure. The default value is 10000 ms.
  - followSitemapXmlOnly: This allows you to limit the connector to only crawl URLs listed within a sitemap.xml file. By default, this is set to false.
  - headers: This allows you to specify additional headers in "key":"value" form.
  - filters: This allows you to specify user-defined filters in standard Allow/Disallow Robots.txt format. For an Example, refer to the robots.txt Wikipedia page. The paths are case sensitive and should be partial (not including domain).
  - supportFileExtensions: This specifies the supported file extensions for extracting files from html source by URL.
  - parallelCrawlsPerSite: This is the number of concurrent crawl operations per site.The default value is 1.
  - excludedTags: This allows you to specify HTML tags that contain links that you do not want to include in your crawls. For example, header, footer, etc.

General Settings WebSite Connector for sites with basic login

Select the General Settings tab.
Note that the fields you see are based on the Authentication mode you selected earlier.
The General Settings tab shown below reflects the service "WebSite Connector for sites with basic login."

Username: This is the username for site basic authentication
Password: This is the Password for site basic authentication
Sites: These is the site URLs that will be crawled. The URLs must be in absolute form including scheme and domain.
Use only user filters: Enable this checkbox to ignore content filters from Robots.txt
Allow subdomains crawl: Enable this checkbox to crawl pages from site sub domains
Settings: The Website connector supports additional settings during crawl to specify non-standard behavior of the operation.
Copy
```
{
    "crawlDelay": 100,
    "timeout": 300000,
    "retryCount": 3,
    "retryDelay": 10000,
    "followSitemapXmlOnly": false,
    "headers": {
        "X-API-Key": "abcdef12345"
    },
    "filters": {
        "allow": [
            "/"
        ],
        "disallow": [
            "*?utm_source"
        ]
    },
    "supportFileExtensions": [
      "pdf","doc","docx","rtf","txt","xls","xlsx","ppt","pptx"
    ],
    "ParallelCrawlsPerSite": 2
}
```
- crawlDelay: This adds a delay, in milliseconds, before any call to target site. The default value is 0.
- timeout: This is the single page crawl timeout in milliseconds. The default value is 5 minutes or 300000 milliseconds.
- retryCountThis specifies how many times the page will be retrieved in case of failure. The default value is 3.
- retryDelay: This specifies how long, in milliseconds, the connector will wait between retries in case of failure. The default value is 10000 milliseconds.
- followSitemapXmlOnly: This allows you to limit the connector to only crawl URLs listed within a sitemap.xml file. By default, this is set to false.
- headers: This specifies additional headers in a "key":"value" form.
- filters: These are user-defined filters in standard Allow/Disallow Robots.txt format. For an Example, refer to the robots.txt Wikipedia page. The paths are case sensitive and should be partial (not including domain).
- supportFileExtensions: This specifies the supported file extensions for extracting files from html source by URL.
- parallelCrawlsPerSite: This is the number of concurrent crawl operations per site. The default value is 1.
- followSitemapXmlOnly: This allows you to limit the connector to only crawl URLs listed within a sitemap.xml file.

General Settings WebSite Connector for sites with Trusted Certificate Authentication

Select the General Settings tab.
Note that the fields you see are based on the Authentication mode you selected earlier.
The General Settings tab shown below reflects the service "WebSite Connector for sites with basic login."
- Max concurrent requests: Specify the maximum number of item data requests that can be processed in parallel by the connector. This field must be a positive number or be empty.
- Sites: These are the site URLs that will be crawled. The URLs must be in absolute form including scheme and domain.
- Certificate, Client ID, and Tenant ID: All 3 of these settings must be added on new lines in the following order:
  - Certificate: This is the distinguished name of the certificate used to authenticate to the websites.
    IMPORTANT!
    The user account running the Website Connector application pool, must have READ access to the Trusted Root Certificate store on the local computer.
    This user account cannot be "Network Service Local account used by the service control manager. Not recognized by the security subsystem, so you cannot specify its name in a call to the LookupAccountName function. Has minimum privileges on the local computer and acts as the computer on the network.".
  - Client ID: This is the Application ID of the client service.
  - Tenant ID: This is the Tenant ID. For more information, see https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app
- Use only user filters: Enable this checkbox to ignore content filters from Robots.txt.
- Allow sub domains crawl: Enable this checkbox to crawl pages from site sub domains
- Settings The Website connector supports additional settings during crawl to specify non-standard behavior of the operation.
  Copy
```
{
  "crawlDelay": 100,
  "timeout": 300000,
  "retryCount": 3,
  "retryDelay": 10000,
  "headers": {
    "X-API-Key": "abcdef12345"
  },
  "filters": {
    "allow": [
      "/"
    ],
    "disallow": [
      "*?utm_source"
    ]
  },
  "parallelCrawlsPerSite": 2
}
```
  - crawlDelay: This adds a delay, in milliseconds, before any call to target site. The default value is 0.
  - timeout: This is the single page crawl timeout in milliseconds. The default value is 5 minutes or 300000 milliseconds.
  - retryCountThis specifies how many times the page will be retrieved in case of failure. The default value is 3.
  - retryDelay: This specifies how long, in milliseconds, the connector will wait between retries in case of failure. The default value is 10000 milliseconds.
  - headers: This specifies additional headers in a "key":"value" form.
  - filters: These are user-defined filters in standard Allow/Disallow Robots.txt format. For an Example, refer to the robots.txt Wikipedia page. The paths are case sensitive and should be partial (not including domain).
  - parallelCrawlsPerSite: This is the number of concurrent crawl operations per site. The default value is 1.

WebSite Connector for sites with OAuth authentication configuration

Obtain a valid oAuth token for crawling

When using the oAuth mode for the connector, you must first obtain a valid oAuth token by following the steps below:

Browse to the URL of the site where the connector is installed.
- http://<connectorUrl>/oauth.aspx where http://<connectorUrl> is the URL of the site where the connector is installed.
- Typically: http://localhost:2406
Specify the oAuth client ID and endpoint URL for the authentication endpoint to be used.
Click Authorize. After clicking Authorize, you are redirected to the authentication server to authenticate as if you were logging on to the website you wish to crawl. After successfully authenticating, you are redirected back to the connector website so that it can capture the access and refresh tokens required for successful crawling.

Configure the Connection

Once the tokens obtained, you are now ready to configure the connection.
The same general configuration options as above apply to the oAuth authentication module.

However, this module has 2 additional settings to specify:

ClientID: Enter the same client ID that you specified when you obtained the oAuth token.
Settings:
- The JSON contains the same settings as above
- You must add a oAuthInitializationPage field to the JSON settings.
- The value must be the URL of the page which will initialize the website with the access or ID token provided by the connector.
- The initialization page must save the ID and/or refresh token in the appropriate place in the browser (local storage, cookie, etc.) in order for (both):
  - The authentication to complete
  - Any requests to be considered valid by the server hosting the application.
- The syntax for this URL is as follows:
  - http://<url of the website to crawl>/<custom page used to authenticate based on the token received>?access={AccessToken}&idToken={IdToken}
  - Example: http://localhost:3000/oAuthInitialization?access={AccessToken}&idToken={IdToken}
- See the Okta configuration example: oAuth Setup Example

Content Configuration Specifics

The connector does not extract the meta tags from the HTML pages crawled.

Instead, please configure Connectivity Hub to extract the tags.
Please refer to the Connectivity Hub documentation on how to extract metadata Provides context with details such as the source, type, owner, and relationships to other data sets. Metadata provides details around the item being crawled by Connectivity Hub. from indexed documents.

Mime Types Configuration

This list of mappings is used to request specific mime types for supported file extensions. (General Settings > Settings > supportFileExtensions). The default Mime types configuration can be found in the MimeTypes section of the Web.config file of the installation folder. The mappings can be changed according to the following rules:

The mimeType field is unique to the list, so each mapping must have a distinct mime type.
The extensions field supports multiple values separated by a comma (for example: extensions="xls,xlt,xla"). These values don't need to be unique in the list and can be used to map multiple mime types.