Azure Document Intelligence component

The Azure Document Intelligence component integrates with the Microsoft Azure Document Intelligence Service to extract plain text and insights from documents using AI technologies. It automatically performs optical character recognition (OCR) on graphical information within documents, extracts specific structured fields from document content, processes tabular data, and outputs it in formats that are easily understandable and consumable by AI models in generative AI processes and vector search. It can easily consume both the built-in models of the Document Intelligence service as well as custom-trained models, providing flexibility in handling various document types and structures.

Additionally, this component seamlessly integrates with the Document Chunker component to offer a complete user experience in question-answering scenarios, such as chatbot conversations. This integration enhances the overall document chunking process, facilitating more efficient and accurate retrieval of information. For more information, see the Integrate with document chunking capabilities section.

Key Features

The component offers several features that overcome limitations of the Document Intelligence service:

  • Expanded file format support: The component supports Document Intelligence functionalities, not just for PDFs, but also for other Office file formats that are not natively supported, such as PowerPoint presentations, Word documents, Excel spreadsheets, and OneNote files.

  • Selective field extraction with confidence scoring: The component offers control over which fields are extracted and allows you to set a confidence score threshold to ensure that only reliable data with a confidence level above a specific value is returned and consumed.

  • Enhanced plain text extraction with page information: When extracting plain text, the component provides additional information about the content of each document page, by indicating which parts of the text are from which page, facilitating easier navigation to the correct location in the documents, and helping users determine where an answer to a question was sourced from.

  • Optimized table data processing: The component helps optimize costs and performance when running queries against table data by reformatting the table information extracted by the Document Intelligence service into a structure that consumes up to 10 times fewer tokens and is more easily understandable by large language models (LLMs).

Prerequisites

You must complete the following prerequisites before setting up the Azure Document Intelligence component:

Limitations

Note the following limitations when using the Azure Document Intelligence component:

  • Only Azure Document Intelligence 3.1 is supported.

  • While numerous document types are supported, documents that contain specialized content, such as handwritten text or complex layouts, may not be recognized as accurately.

  • Custom model training requires a sufficient amount of labeled data. If the training set is too small or not diverse, the model's performance may suffer.

  • There are rate limits and quotas on API calls, which may impact high-volume applications. For more information on rate limits and quotas, see the Microsoft documentation.

  • While it supports multiple languages, certain languages or dialects may have limited accuracy compared to others. For a list of supported languages, see the Microsoft documentation.

  • The quality of input images can significantly affect recognition accuracy. Low-resolution or distorted images may yield poor results.

  • The extraction output is primarily JSON, which may require additional processing for certain applications.

  • Office files (.doc/.docx, .xls/.xlsx, .ppt/.pptx, and .one) are not supported by document intelligence and have been converted to .pdf internally. As a result, they may not be converted accurately and produce an inaccurate output.

  • Document intelligence returns table in JSON and HTML format, to which additional information is added like page number and context. The context data can be inaccurate based on the values before or after the table. For example, there are two tables in a row, no content after the table, etc.

Configure the component

To configure you component, do the following:

  1. In the AutoClassifier administration portal, Add a new component to a new or existing pipeline.

  2. When adding your component, select Azure Document Intelligence from the New Component list and provide a name for your component.

  3. In the Configuration section, expand the General Settings and specify the following fields

    1. In the Document Intelligence Endpoint field, enter the endpoint of your Azure Document Intelligence resource.

      1. To find your endpoint, in the Azure Portal, click on your Document Intelligence resource.

      2. In the left panel, click Keys and Endpoint and copy the value in the Endpoint field.

    2. In the Api Key field, enter the Api key for your Azure Document Intelligence resource.

      1. To find your Api key, in the Azure Portal, click on your Document Intelligence resource.

      2. In the left panel, click Keys and Endpoint and copy the value in the KEY field.

    3. In the Accepted Extensions field, enter a comma separated list of the file extensions that you want the component to extract data from. By default, the following file extensions are accepted: .jpg, .jpeg, .jpe, .jif, .jfi, .jfif, .pdf, .png, .tif, .tiff, .docx, .pptx, .one, and .xlsx.

    4. In the Document Intelligence model field, select the model you want to use from the drop-down list. For this list to be populated, you must provide valid values for the Document Intelligence Endpoint and Api Key fields. You can select pre-built document intelligence models or any custom models that you have created in Azure Document Intelligence Studio. To create a custom AI model, refer to the Azure Document Intelligence Studio documentation.

    5. Enable the Convert office files for processing field to allow the component to convert Microsoft Office files to .pdf for processing, as the current release of Azure document studio doesn't support Microsoft Office documents. This field is enabled by default.

  4. Expand the Analyze Options section and complete the following fields:

    1. Enable the Extract Text field to allow the component to extract text from the file. This field is enabled by default.

    2. In the Max Text Data to Extract (MB) field, specify the maximum size of text data that can be extracted, in megabytes. By default this field is set to is 2 (MB).

    3. Enable the Extract Key Value Pairs field to allow the component to extract key-value pairs from the file.

    4. Enable the Extract Fields field to allow the component to extract query fields from the document. This field is enabled by default.

    5. Enable the Append Table as Child Items field to allow the component to append tables as child items. When this field is enabled, you must have a document chunker component integrated in your pipeline. For more information, see Integrate with document chunking capabilities.

    6. Enable the Extract Tables in HTML format field to allow the component to extract tables in HTML format from the document. This field is enabled by default.

    7. In the Whitelist Keys field, specify a comma separated list of keys for key value pairs and fields. The component will automatically filter and extract only those specified keys.

    8. In the Confidence Threshold field, specify a confidence score threshold. The component will only return metadata with a confidence score that meets or exceeds the specified threshold.

    9. In the Page Range field, specify a range of page numbers that you want to be analyzed.

    10. In the Max File Size to Process (MB) field, specify the maximum size of the file that you want to process. The default value for this field is 100 MB.

    11. In the Character before and after table field, specify the number of characters that before or after tables that you want to extract. The default value for this field is 100 characters.

    12. In the Optional Detection field, you can enable the following optional Azure features:
      1. Barcode: This field enables the Barcode extraction feature. This feature extracts all identified barcodes from your documents and returns them as separate metadata.
      2. Language: This field enables the Language extraction feature. This feature extracts the detected primary language for each text line from your documents and returns them as separate metadata.
    13. In the Premium Detection (charged additionally) field, you can enable the following premium Azure features:

      1. High Resolution: This field enables the High Resolution extraction feature. This feature can recognize small text from large-size documents, such as engineering drawings, with varying fonts, sizes, and orientations.

      2. Style Font: This field enables the Font property extraction feature. This feature extracts all identified formulas, such as mathematical equations, from your documents and returns them as separate metadata.

      3. Formulas: This field enables the Formula extraction feature. This feature extracts all font properties of text, such as font family, font style, font weight, and color, and returns them as separate metadata.

    14. Enable the Send raw response as metadata field to return the raw data returned from the Azure Document Intelligence resource as a JSON.

Output details

On the Pipeline Testing page, you can test your connector configuration. When you do so, the following Output properties will be visible:

Output property Description Metadata description
ExtractedText This output property displays the complete plain text that was extracted from the file. -
ExtractedTextWithPageNumber This output property displays the complete plain text, as well as the page number metadata that was extracted from the file.
  • Page Number: This is the page number that corresponds to the plain text.

  • Plain Text: This is the entire extracted plain text.

AllExtractedFields This output property displays all of the key-value pairs and fields that were extracted from the file.
  • FieldName: This is the field name or key (in case of key value pair) that was extracted from the file.

  • FieldValue: This is the field value or value(in case of key value pair) that was extracted from the file.

  • Confidence: The confidence score is a number between 0 and 1. The higher the score, the more confident the model is that it is giving you a correct value in a specific field.

ExtractedTables This output property displays all of the tables, as JSON or HTML, that were extracted from the file.
  • Name: This is the table number (Generally “Table1, Table 2, etc.) that was extracted from the file.

  • Context: These are the characters that were extracted before and after table based on the input given during configuration.

  • Data: This is the table data in Json format.

ExtractedPageHeaders This output property displays a list of page headers that were extracted from the file. -
ExtractedPageFooters This output property displays a list of page footers that were extracted from the file. -

Integrate with document chunking capabilities

You can add document chunking capabilities to your extracted Azure Document Intelligence content. The document chunker will automatically use the text extracted by the Document Intelligence component to chunk all of the document's text. If you are applying chunking to your documents that have been processed by Azure Document Intelligence, note the following:

  • The document intelligence component must be added to the pipeline before the Document Chunker component.

  • The document intelligence metadata that the document chunker consumes to create document chunks is ExtractedText and ExtractedTextWithPageNumber.

  • Tables are generated as separate chunks by DI, with table context, such as text before and after the table configurable in length, appended to those chunks to help with search and retrieval.

  • When using document chunking, you must enable the Append Table as Child Items field to ensure that document tables are dispalyed in the document chunker output.