Configure the document chunker component

The document chunker component allows you separate a document by specific parameter to breakdown and combine content into smaller, more manageable segments for improved processing and analysis. AutoClassifer provides configuration settings to allow you to specify how these document chunks are collected. To configure your document chunker component, do the following:

If you do not want to use the document chunker component to create document chunks, you can also do so in a custom scripting stage.
  1. In the AutoClassifier administration portal, Add a new component to a new or existing pipeline.

  2. When adding your component, select Document Chunker from the New Component list.

  3. In the Trigger section, select the triggers to determine when the component is called. The following triggers are enabled for the document chunker component by default:

    • Connector External Content Service

    • Connector Test Bench

    • External Access Point

    • Recommind

    • Suggestions

    • Connector Crawler

    • Connector Target

    • Custom

    • Playback or Testing

  4. In the Configuration section, specify the following settings:

    1. In the File Extension Property Name field, specify a property name for the file extension.

    2. In the Chunk Body Property Name field, specify a property name for the chunk body.

    3. In the Max Text To Extract field, specify the maximum number in megabytes of text you want to extract from a document. If no value is specified, all of the text will be extracted from the document.

    4. In the Chunk Size field, specify the number of characters that each chunk will contain. The default value for this field is 4000 characters.

    5. In the Chunk Overlap Size field, specify the percentage that characters that can overlap between two adjacent chunks. This percentage is based on the Chunk Size field that was configured above. The default value for this field is 10%.

    6. In the Chunk Separators field, specify the characters that you want to use to separate chunks. When specified, a new chunk will begin when one of these separators are identified after the Chunk Size has been reached. Every chunk separator is delimited by a #. By default, this value is "\n# #.#", this separates chunks by a new line (\n), a space, or a period. This separator pattern ensures that text chunking will not occur in the middle of a word. Additionally, the order in which these separators are written is important as it will impact the results.

    7. In the Microsoft Office Documents section, you can enable the following fields to add additional chunking capabilities.

      This section is only valid for Microsoft Office documents and will not apply to any other document types.
      1. Split By Chapter/Table of Contents: When this is enabled, Microsoft Word documents will be chunked by chapters or table of contents.

      2. Split By Slides: When this is enabled, Microsoft PowerPoint documents will be chunked by slides.

      3. Split By Sheets: When this is enabled, Microsoft Excel documents will be chunked by sheets.

      Note the following when chunking Microsoft Office documents:
      • Microsoft Word:
        • Documents can only be chunked by up to three levels in the table of contents. If a document has items in the table of contents that exceed three levels, those items will be included in the third-level chunk.
        • Text from images or videos is not extracted.
        • Text from comments is not extracted.
        • Text from tracked changes is not extracted.
        • If you want to include the page number from the original document that the document chunk originated from in your metadata, you must configure the document intelligence component to extract the plain text from your documents before the document chunker is called. If you choose to include page numbers, you must disable the Split By Chapters/ Table Of Contents setting. For more information, see Integrate with document chunking capabilities in the document intelligence component documentation.

      • Microsoft PowerPoint:
        • Text from images is not extracted.
        • Text from slide notes or comments is not extracted.
      • Microsoft Excel:
        • Text from complex graphs or images is not extracted.
        • Only the first chunk includes the header from the excel file. Subsequent chunks do not have the header included.
    8. To extract metadata from the chunked document, enable the Extract Metadata field in the Metadata Settings section.

    9. To extract metadata as json from the chunked document, enable the Extract Metadata field in the Metadata Settings section.

    10. Click Save.

Once you have configured the document chunker component, you can add additional components and scripts to enhance your chunked data. For more information, see Configure the chunk processor component and Alter chunks with a script component.

After you have chunked your data, you can configure your Connectivity Hub instance to crawl the chunked items in the enrichment pipeline settings for your content source. For more information, see How to configure your Azure Cognitive Search target in the Connectivity Hub documentation.

Output metadata properties

The Document Chunker returns the following metadata properties and values:

Property Description
<ChunkBodyProperty> This property displays the full text of the specific document chunk. The name of this property is dependent on the value that you configured in the Chunk Body Property Name field in step 4b.
ChunkBodyPropertyName This property displays the name of the chunk body property that you specified in step 4b.
ChunkPageNumber This property displays the page number of the original document that the chunk was extracted from.