How to Extract Plain Text from Documents

Some search engines such as Azure Cognitive Search only accept plain text data when indexing content.

  • For these search engines, ConnectivityHub can use the Tika Text Extraction Service to extract the plain text and submit it to the search engine in lieu of the original document.
  • The Tika Text Extraction Service installation instructions are here: Install the BA Insight Tika Text Extraction service.

The instructions below describe how to use the service:

How to Connect Your Content Source to the Tika Text Extraction Service

  1. Edit your content source and go to the Advanced page.

  2. Edit the "Enrichment pipeline integration" field and select Enrichment web service.

  3. Specify the URL of the Tika extraction service in the Service URL text box.
    The address is typically:
       http://localhost:7890/TikaCews.svc

  4. Specify the "Properties returned" as follows: body,Text,false
    Your configuration settings should resemble the graphic shown here:


  5. Click Save to apply your changes 

Create a New Metadata Property to Receive the Extracted Plain Text

  1. Go to the Content Sources page in Connectivity Hub.

  2. In the context menu for your content source, click Metadata...



  3. On the Metadata page, click New > Text Metadata.

  4. Enter the Metadata name (typically body).

  5. For the Value field, select "The value is calculated by an enrichment pipeline" and select body in the subsequent drop down list:



  6. Click Save.