How to Generate Vector Scripting

This is an example on how vector embeddings can be generated in AutoClassifier scripting component based on the processed document metadata (Example: “body”).
  • This code is just a sample to be used as guideline for generating vector embeddings using large language models (LLMs).

  • The code can be changed or reworked to work with any LLM and consume any type of information as input for the embeddings generation model.

In conjunction with Connectivity Hub, this sample script can help pushing vector embeddings into search indexes that support such fields.

  • In the sample the model "text-embedding-ada-002” from OpenAI is used

  • In this sample, the metadata “body” is used

    • So in order to work a Tika pipeline stage is required to be part of the processing pipeline.

Setup Steps

  1. Navigate to the management page of AutoClassifier Engine.

  2. Create/Modify an online pipeline to contain the following two components:

    • Tika Extractor - To extract the body of the documents.

    • Script - To generate the embeddings.

Tip: You can use any embedding provider (Large Language Model) that you want.
  1. The code below is just an example of extracting embeddings using OpenAI and its "text-embedding-ada-002” model.

  2. In order to configure the pipeline that generates the embeddings, provide values for the required settings:

  3. namespaces - add the namespaces in the list below to your scripting pipeline namespaces configuration:

Note: You also need to add the Assemblies (DLLs) of which the namespaces below belong.

Copy
System.Text
System.Linq
System.Net.Http
Newtonsoft.Json
Newtonsoft.Json.Linq
  1. apiKey - Add your API Key

  2. apiUrl - Add the API URL

    1. Default value for OpenAI:

      • https://api.openai.com/v1/embeddings

  3. model - Add the name of the model used to extract the embeddings

    1. OpenAI recommends using the model “text-embedding-ada-002”

  4. properties - Provide the list of properties for which the embeddings are generated and the name of the output properties

Copy

Sample Code

// configure the three variables below with the customer specific data
string apiKey = "";
string apiUrl = "";
string model = "";

//variable representing the properties to be processed for embeddings
//and their associated vector embedding output property; can be multiple values 
Dictionary<string, string> properties = new Dictionary<string, string> {{"body", "bodyEmbeddings"}};
 
foreach (KeyValuePair<string, string> property in properties)
{
    // iterating each input-output property set to generate the embeddings
    double[] embeddings;
    
    //collecting the property value from the current item that is processed     
    var text = item.Get<string>(property.Key);    
   
    //preparing the request body for the OpenAI embeddings generation request 
    var request = new Dictionary<string, object>
        {
            { "model", model },
            { "input", text }
        };
 
    //setting up the Http call using the API key, model, endpoint and input text as part of request body
    var httpClient = new HttpClient();
    httpClient.DefaultRequestHeaders.Add("Authorization", "Bearer " + apiKey);
    var requestJson = JsonConvert.SerializeObject(request);
    var requestContent = new StringContent(requestJson, Encoding.UTF8, "application/json");
    var httpResponseMessage = httpClient.PostAsync(apiUrl, requestContent).ConfigureAwait(false).GetAwaiter().GetResult();
    
    //reading the response from OpenAI API as JSON format string 
    var jsonString = httpResponseMessage.Content.ReadAsStringAsync().ConfigureAwait(false).GetAwaiter().GetResult();
     
    try
    {
    //collecting the actual embeddings and transforming the into an array of doubles
        JObject jsonObject = JObject.Parse(jsonString);
        JArray embeddingArray = jsonObject["data"][0]["embedding"] as JArray;
        embeddings = embeddingArray.Select(e => (double)e).ToArray();
    }
    catch (Exception ex)
    {
        Log.Warn("Error parsing Embedding data: " + ex.Message);
        embeddings = new double[0];
    }
    
    //setting the embeddings as a new output property 
    item.Set<List<double>>(property.Value, embeddings.ToList());
}