Generate vector embeddings for chunked content

With vector search, you can apply vector embeddings to your content to categorize content that may be similar thematically, without having to contain specific keywords. AI/vector search works best when vector embeddings are generated on smaller chunks of text. if a single embedding is generated for a large document, the relevancy and accuracy of the search results will be impacted. AutoClassifier offers the ability to both chunk and generated vector embeddings onto your documents, providing a robust solution to effectively surface highly relevant search results.

Applying chunking capabilities to your data

In AutoClassifier 7.0.0.0, the ability to chunk your indexed files was introduced. Chunking allows you to break down your content into smaller, more manageable segments, in order to surface search results that are more accurate and relevant to the user search query. For example, when a user searches for a specific term, AutoClassifier can use specified separators to chunk the content and surface the content that is the best match for the query. You should note that applying chunking components to your indexed content will impact your user search experience, as search results will yield chunks of documents rather than the whole document.

To apply document chunking capabilities to your content sources, you must add the document chunker and chunk processor components to your pipeline.

Limitations

  • Chunking capabilities are currently only compatible with Azure Cognitive Search, AWS Opensearch, and ElasticSearch targets. For more information on setting up these targets, see configure your target.

  • If you are using the Document Chunker component in AutoClassifier, any properties that are defined by a script in Connectivity Hub will not appear in the index. You must define scripted properties in an AutoClassifier script, and specify the property in the Properties To Include field in the Document Chunker configuration settings.

AutoClassifier component flow

The following table shows the overall flow that you must configure to add vector embeddings to chunked content:

Step Additional information
1. In your main pipeline that you use to receive documents, configure the document chunker component You can configure your chunking component to specify various factors such as the number of characters per chunk, the overlap percentage between chunks, the maximum size of text extracted from documents, and the ability to extract metadata from the file.
2. In your main pipeline, configure the chunk processor component to call another pipeline to process each chunk This chunk processor component allows you to specify a pipeline that you want to use to process your chunks. By doing so, you will be able to treat each individual chunk as if it were its own document and apply an entire pipelines functionality to your chunks to further process and enhance them.
3. In the processing pipeline, configure a custom script component that applies vector embedding to the chunks Use a scripting component to apply vector embeddings to the chunk body property that was specified when you configured the document chunker.
4. Test your pipelines to verify that you are receiving chunks and vector embeddings in your metadata Use the Pipeline Testing page to test your components and pipelines. After running a test, you can verify that you are receiving document chunks and vector embeddings in your output metadata properties.