Extracting the table of contents for metadata appendation
The Section Annotator component extracts section titles from a document’s table of contents and attaches them as metadata to each generated document chunk. This provides clear structural context for every chunk, improving downstream relevance, traceability, and explainability in search, retrieval, and AI-driven experiences.
Prerequisites
-
If you are using Azure OpenAI you must create and deploy an Azure OpenAI service resource.
-
If you are using OpenAI, you must know the OpenAI model and the API key for your OpenAI resource. The model you use must be gpt-4o or later.
-
You muse have the Document Chunker component configured, and the Section Annotator must follow it in the pipeline stage list.
Configure the component
To configure your component, do the following:
-
In the AutoClassifier administration portal, Add a new component to a new or existing pipeline.
-
When adding your component, select Section Annotator from the New Component list and provide a Name for your component.
-
In the Select AI Provider field, select the AI provider you are using. The possible options are Azure OpenAI and OpenAI.
-
If you selected Azure OpenAI, complete the following:
-
In the Endpoint URL field, enter the endpoint URL for your Azure OpenAI service resource.
-
In the API Key field, enter the API key for your Azure Open AI instance.
-
-
If you selected OpenAI, complete the following:
-
In the Model field, enter the model name for your OpenAI model.
-
In the API Key, field, provide the API key for your OpenAI resource.
-
-
-
In the TOC Extaction Prompt field, enter a prompt that you will send to your AI resource in order to extract the table of contents from your documents. You may use the default prompt, or provide a custom prompt to better fit your organizational needs.
-
In the Section Lookup Pattern field, specify a regular expression to identify section headings in the document’s table of contents. The pattern determines how section names are detected and associated with document chunks so the relevant section title can be added as metadata. You may use the default pattern, or provide a custom prompt to better fit your organizational needs.
-
In the TOC Lookup Location field, use the drop-down menu to specify where in the document the table of context is located. The possible values include Beginning, End, and Both.
-
In the TOC Character Count field, specify a value to represent the number of characters from the lookup location that the AI resource will consider for table of contents extraction. By default, this value is set to 15000.
-
In the Section Text Separator field, specify a string that will be inserted between the extracted section metadata and the chunk body text when both are combined into the ChunkBody. This separator helps clearly distinguish section context from the main content for downstream processing. If nothing is specified, the section name is appended directly to the chunk text without a separator.
-
In the Select the section format to apply in chunk body, Select an option to specify how extracted section metadata is written into the ChunkBody. You can select Flat Sections to include only the matched section name or Structured Sections to preserve the full section hierarchy for additional context. For Example, Access Control vs Security > Authentication > Access Control.
-
In the Structured Sections Format field, you can define a custom output format for structured section metadata when Structured Sections is selected. Use this to control how section hierarchy levels are combined and displayed in the ChunkBody.
-
in the Structured Section Path Separator, You can specify the character or string used to separate section names in the ChunkBody when Structured Sections is selected.
-
In the TOC Property Name field, you can configure the property name that displays the table of contents value in your output metadata. By default, this is set to TableOfContents.
-
In the Chunk Section Property Name field, you can configure the property name that displays the section value in your output metadata. By default, this is set to ChunkSections.
-
In the Chunk Body Property Name field, enter the value of your chunk body property as specified when you configured your Document Chunker component.
-
Click Save.
-
In your pipeline, ensure that you position the Section Annotator component below your Document Chunker component.