How to Extract Plain Text from Documents
Some search engines such as Azure Cognitive Search only accept plain text data when indexing content.
- For these search engines, ConnectivityHub can use the Tika Text Extraction Service to extract the plain text and submit it to the search engine in lieu of the original document.
- The Tika Text Extraction Service installation instructions are here: Install the BA Insight Tika Text Extraction service.
The instructions below describe how to use the service:
How to Connect Your Content Source to the Tika Text Extraction Service
- Edit your content source and go to the Advanced page.
- Edit the "Enrichment pipeline integration" field and select Enrichment web service.
- Specify the URL of the Tika extraction service in the Service URL text box.
The address is typically:
http://localhost:7890/TikaCews.svc - Specify the "Properties returned" as follows:
body,Text,false
Your configuration settings should resemble the graphic shown here: - Click Save to apply your changes
Create a New Metadata Property to Receive the Extracted Plain Text
- Go to the Content Sources page in Connectivity Hub.
- In the context menu for your content source, click Metadata...
- On the Metadata page, click New > Text Metadata.
- Enter the Metadata name (typically
body
). - For the Value field, select "The value is calculated by an enrichment pipeline" and select
body
in the subsequent drop down list: - Click Save.