Configure the document chunker component

The document chunker component allows you separate a document by specific parameter to breakdown and combine content into smaller, more manageable segments for improved processing and analysis. AutoClassifer provides configuration settings to allow you to specify how these document chunks are collected. To configure your document chunker component, do the following:

If you do not want to use the document chunker component to create document chunks, you can also do so in a custom scripting stage.

In the AutoClassifier administration portal, Add a new component to a new or existing pipeline.
When adding your component, select Document Chunker from the New Component list.
In the Configuration section, specify the following settings:
1. In the File Extension Property Name field, specify a property name for the file extension.
2. In the Chunk Body Property Name field, specify a property name for the chunk body.
3. In the Max Text To Extract field, specify the maximum number of text, in megabytes, that you want to extract from a document. If no value is specified, all of the text will be extracted from the document. The default value for this field is 2MB.
  
  If this field is set to empty, you may experience timeout issues in Connectivity Hub.
4. In the Properties To Include (Parent to Child) field, specify the parent property names that you want to include in child items in the document chunker response. You can specify multiple properties, separated by a semicolon (;). For example, Body;_RawData;etc.
5. In the Chunk Size field, specify the number of characters that each chunk will contain. The default value for this field is 4000 characters.
  
  If this field is configured to be less than 4000 characters, you may experience timeout issues in Connectivity Hub.
6. In the Chunk Overlap Size field, specify the percentage that characters that can overlap between two adjacent chunks. This percentage is based on the Chunk Size field that was configured above. The default value for this field is 10%.
7. In the Chunk Separators field, specify the characters that you want to use to separate chunks. When specified, a new chunk will begin when one of these separators are identified after the Chunk Size has been reached. Every chunk separator is delimited by a #. By default, this value is "\n# #.#", this separates chunks by a new line (\n), a space, or a period. This separator pattern ensures that text chunking will not occur in the middle of a word. Additionally, the order in which these separators are written is important as it will impact the results.
8. In the Microsoft Office Documents section, you can enable the following fields to add additional chunking capabilities.
  
  This section is only valid for Microsoft Office documents and will not apply to any other document types.
  1. Split By Chapter/Table of Contents: When this is enabled, Microsoft Word documents will be chunked by chapters or table of contents.
  2. Split By Slides: When this is enabled, Microsoft PowerPoint documents will be chunked by slides.
9. In the Excel Chunking Settings section, you can enable the following fields to add additional chunking capabilities for Microsoft Excel files.
  1. Enable the Split By Sheets field to chunk Microsoft Excel documents by sheets.
  2. In the No. of rows to extract as Headers field, you can specify the number of rows that you expect to be headers In your Excel sheet. These rows will be repeated in every chunk. By default, this field is set to 3.
  3. In the Excel Chunk Size field, you can specify the size of the chunks (in number of characters) for your Excel sheets. By default, this field is set to 50000 characters. For Excel files, this field overrides the value specified in the Max Text To Extract field.
  4. In the Extracted Rows format you can specify if you want the extracted rows from your Excel sheets to be in HTML or tab-delimited format. By default, HTML is selected.
  Note the following when chunking Microsoft Office documents:
  - Microsoft Word:
    - Documents can only be chunked by up to three levels in the table of contents. If a document has items in the table of contents that exceed three levels, those items will be included in the third-level chunk.
    - Text from images or videos is not extracted.
    - Text from comments is not extracted.
    - Text from tracked changes is not extracted.
    - If you want to include the page number from the original document that the document chunk originated from in your metadata, you must configure the document intelligence component to extract the plain text from your documents before the document chunker is called. If you choose to include page numbers, you must disable the Split By Chapters/ Table Of Contents setting. For more information, see Integrate with document chunking capabilities in the document intelligence component documentation.
  - Microsoft PowerPoint:
    - Text from images is not extracted.
    - Text from slide notes or comments is not extracted.
  - Microsoft Excel:
    - Text from complex graphs or images is not extracted.
10. To extract metadata from the chunked document, enable the Extract Metadata field in the Metadata Settings section.
11. To extract metadata as json from the chunked document, enable the Extract Metadata field in the Metadata Settings section.
12. Click Save.

Once you have configured the document chunker component, you can add additional components and scripts to enhance your chunked data. For more information, see Configure the chunk processor component and Alter chunks with a script component.

After you have chunked your data, you can configure your Connectivity Hub instance to crawl the chunked items in the enrichment pipeline settings for your content source. For more information, see How to configure your Azure Cognitive Search target in the Connectivity Hub documentation.

Output metadata properties

The Document Chunker returns the following metadata properties and values:

Property	Description
<ChunkBodyProperty>	This property displays the full text of the specific document chunk. The name of this property is dependent on the value that you configured in the Chunk Body Property Name field in step 4b.
ChunkBodyPropertyName	This property displays the name of the chunk body property that you specified in step 4b.
ChunkPageNumber	This property displays the page number of the original document that the chunk was extracted from. Note the following: For Microsoft Word documents, this will display -1 as page numbers are not extracted. For Microsoft Excel documents, this will display the corresponding Excel sheet number. For Microsoft PowerPoint documents, This will display the corresponding slide number. If the documents are not of any of the types listed above, or if any of the Split By ... fields are not enabled, the ChunkPageNumber value will be "-1", indicating that no page number is applicable. If you have also integrated Document Intelligence in your pipeline, the page number will be retrieved by Document Intelligence even if the Split By ... fields are turned off.
ChunkWorkbookName	This property displays the name of your Microsoft Excel workbook. This output property is only available when the Split by sheets checkbox is enabled.
ChunkWorkSheetName	This property displays the name of your Microsoft Excel sheet. This output property is only available when the Split by sheets checkbox is enabled.
ChunkSlideTitle	This property displays the title of the PowerPoint slide. This output property is only available when the Split by slides checkbox is enabled.
ChunkSlideSectionName	This property displays slide section name of the PowerPoint slide. This output property is only available when the Split by slides checkbox is enabled.
ChunkBodyForVectorization	This property provides the plain text of ChunkBody property, removing the HTML tags. If the ChunkBody does not have the valid HTML string, then the same content is copied to the ChunkBodyForVectorization.