AutoClassifier Components

You can add AutoClassifier components to your pipelines to enable specific functionality when the pipeline is triggered. The following table lists all of the AutoClassifier pipeline components that are currently available.

Component Description
Amazon Comprehend Medical This component allows for medical text for entity extraction, sentiment analysis, and language detection based on AWS Comprehend services.
Amazon Comprehend NLP This component allows for text for entity extraction, sentiment analysis, and language detection based on AWS Comprehend services.
Azure Document Intelligence component allows you to leverage models created in Azure Document Intelligence Studio to extract relevant data from your documents
Chunk Processor This component allows you to independently process the chunks that were created in a document chunker stage. Each chunk can be processed as a separate document and additional pipeline functionality can be applied to each chunk.
Content Enrichment This component allows you to integrate SharePoint Content Enrichment Web Services into your pipeline.
Custom Entity Extraction This component extracts entities from any text based on a list of entity names.
Custom Vision AI This component analyzes images and extracts detected tags based on trainable services from Custom Vision API.
Document chunker This component allows you to break down your indexed content into smaller, more manageable segments, to surface search results that are more accurate and relevant to the user search query.
Duplicates Detection This component detects the degree of similarity between documents and can be used to group documents based on the similarity.
Email Processing

This component automatically extracts email related metadata from .msg and .eml files, extracting the following properties:

  • To

  • From

  • Subject

  • Sent date

  • Received date

HTML Markup Cleaner This component removes all HTML markup tags from the configured metadata properties and returns plain text.
Image Extractor This component extracts images inside PDFs and Open XML documents (docx, xlsx, pptx).
Image Processor Amazon Rekognition This component analyzes text from images based on AWS Rekognition services.
Image Processor MS Computer Vision This component analyzes images and extracts detected text and concepts based on Microsoft Computer Vision API.
Item Sorter This component saves processed item(s) metadata on disk, and sorts them by common values of specific metadata property.
Language Detector This component detects multiple languages in documents using NTextCat library.
LexisNexis This component calls the Lexis Search Advantage Classification Engine to extract legal metadata from content.
MeSH Tagger This component returns related PubMed articles and applies Medical subject heading (MeSH) term tags.
Metadata Filtering This component filters the metadata that is received and only allows the configured ones to be returned as output.
Metadata Name Sanitizer This component removes special characters from metadata names.
Metadata Singularization This component singularizes received metadata.
Metadata Values Capture This component captures and exports metadata values usage across the processed items.
Microsoft Computer Vision OCR This component analyzes images and extracts detected text and concepts based on Microsoft Computer Vision READ API.
Microsoft Text Analytics This component analyzes text for entity extraction, sentiment analysis, and language detection based on Microsoft Cortana Intelligence Suite services.
NLQ Metadata Capturer This component captures and processes metadata values to be used in NLQ processing.
Offline Processing This component allows for offline processing of documents and metadata. Used for time consuming algorithms which impact crawl performance. Results are picked up on next incremental crawl.
PACER Metadata Extractor This component extracts PACER court documents specific metadata.
Recorder This component records content and metadata of documents during crawl. This can be used to collect content from source systems or play back documents for testing and troubleshooting.
Regex Extractor This component provides a standardized approach for Regex Expressions.
Rules Engine This component is used by BAInsight AutoClassifier to provide a rule-based way to automatically tag documents during crawling using content and metadata.
SciSpacy NER This component analyzes text with ScispaCy models and extract named entities and their entity linking.
Script This component allows you to define a script in C# or VB.NET to process document content and metadata. For example, you can add a script component to store information, extract additional metadata, use external datasources like databases, etc.
Section Headers Extractor This component extracts section headers from documents based on regex patterns.
Section Information Extractor This component is used to identify document sections and extract specific information from each section using NLP.
Slide Title Extractor This component extracts the titles from PowerPoint documents (pptx).
Smart Previews This component is needed by BA Insight Smart Previews to generate crawl time document previews. Smart Previews provides high level preview functionality for search results in SharePoint.
SmartHub Best Bets Feeder This component feeds the SmartHub Best Bets engine with data at crawl time.
SmartHub QnA Document Feeder This component feeds the SmartHub Q&A provider index with data at crawl time.
SmartHub QnA Feeder This component feeds the SmartHub Q&A engine with data at crawl time
Spacy NER This component analyzes text with spaCy models and extracts named entities, key phrases, entities and sentences.
Summary Generator This component generates text summary based on provided or calculated important words or entities.
Tag Threshold Limits This component applies thresholds to properties to limit the number of values returned.
Tika Extractor This component extracts the body and metadata from raw binary files. You can use this component when the calling source system is unable to extract the body and metadata, as classification stages require this data.
Video Processor Microsoft Video Indexer This component analyzes videos and extracts detected text and concepts based on Azure Video Indexer API.
West KM Metadata V6 Elastic This component calls the West KM Version 6 Legal Knowledge Management Classification Elastic Engine to extract legal metadata from content.
West KM Transactional V5 This component calls the West km Legal Knowledge Management Classification Engine to extract legal metadata from content.