Generate vector embeddings for chunked content

With vector search, you can apply vector embeddings to your content to categorize content that may be similar thematically, without having to contain specific keywords. AI/vector search works best when vector embeddings are generated on smaller chunks of text. if a single embedding is generated for a large document, the relevancy and accuracy of the search results will be impacted. AutoClassifier offers the ability to both chunk and generated vector embeddings onto your documents, providing a robust solution to effectively surface highly relevant search results.

Applying chunking capabilities to your data

In AutoClassifier 7.0.0.0, the ability to chunk your indexed files was introduced. Chunking allows you to break down your content into smaller, more manageable segments, in order to surface search results that are more accurate and relevant to the user search query. For example, when a user searches for a specific term, AutoClassifier can use specified separators to chunk the content and surface the content that is the best match for the query. You should note that applying chunking components to your indexed content will impact your user search experience, as search results will yield chunks of documents rather than the whole document.

To apply document chunking capabilities to your content sources, you must add the document chunker and chunk processor components to your pipeline.

Limitations

Chunking capabilities are currently only compatible with an Azure Cognitive Search target. For more information on setting up this target, see How to configure your Azure Cognitive Search target in the Connectivity Hub documentation.

AutoClassifier component flow

The following table shows the overall flow that you must configure to add vector embeddings to chunked content:

Step Additional information
1. In your main pipeline that you use to receive documents, configure the document chunker component You can configure your chunking component to specify various factors such as the number of characters per chunk, the overlap percentage between chunks, the maximum size of text extracted from documents, and the ability to extract metadata from the file.
2. In your main pipeline, configure the chunk processor component to call another pipeline to process each chunk This chunk processor component allows you to specify a pipeline that you want to use to process your chunks. By doing so, you will be able to treat each individual chunk as if it were its own document and apply an entire pipelines functionality to your chunks to further process and enhance them.
3. In the processing pipeline, configure a custom script component that applies vector embedding to the chunks Use a scripting component to apply vector embeddings to the chunk body property that was specified when you configured the document chunker.
4. Test your pipelines to verify that you are receiving chunks and vector embeddings in your metadata Use the Pipeline Testing page to test your components and pipelines. After running a test, you can verify that you are receiving document chunks and vector embeddings in your output metadata properties.