How to Generate Vector Scripting
This code is just a sample to be used as guideline for generating vector embeddings using large language models (LLMs).
The code can be changed or reworked to work with any LLM and consume any type of information as input for the embeddings generation model.
In conjunction with Connectivity Hub, this sample script can help pushing vector embeddings into search indexes that support such fields.
-
In the sample the model "text-embedding-ada-002” from OpenAI is used
-
In this sample, the metadata “body” is used
-
So in order to work a Tika pipeline stage is required to be part of the processing pipeline.
-
Setup Steps
-
Navigate to the management page of AutoClassifier Engine.
-
Create/Modify an online pipeline to contain the following two components:
-
Tika Extractor - To extract the body of the documents.
-
Script - To generate the embeddings.
-
-
The code below is just an example of extracting embeddings using OpenAI and its "text-embedding-ada-002” model.
-
In order to configure the pipeline that generates the embeddings, provide values for the required settings:
-
namespaces - add the namespaces in the list below to your scripting pipeline namespaces configuration:
For more details see: How to Use Custom Logic (Script)
System.Text
System.Linq
System.Net.Http
Newtonsoft.Json
Newtonsoft.Json.Linq
-
apiKey - Add your API Key
-
apiUrl - Add the API URL
-
Default value for OpenAI:
-
https://api.openai.com/v1/embeddings
-
-
-
model - Add the name of the model used to extract the embeddings
-
OpenAI recommends using the model “text-embedding-ada-002”
-
-
properties - Provide the list of properties for which the embeddings are generated and the name of the output properties
Sample Code
// configure the three variables below with the customer specific data
string apiKey = "";
string apiUrl = "";
string model = "";
//variable representing the properties to be processed for embeddings
//and their associated vector embedding output property; can be multiple values
Dictionary<string, string> properties = new Dictionary<string, string> {{"body", "bodyEmbeddings"}};
foreach (KeyValuePair<string, string> property in properties)
{
// iterating each input-output property set to generate the embeddings
double[] embeddings;
//collecting the property value from the current item that is processed
var text = item.Get<string>(property.Key);
//preparing the request body for the OpenAI embeddings generation request
var request = new Dictionary<string, object>
{
{ "model", model },
{ "input", text }
};
//setting up the Http call using the API key, model, endpoint and input text as part of request body
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.Add("Authorization", "Bearer " + apiKey);
var requestJson = JsonConvert.SerializeObject(request);
var requestContent = new StringContent(requestJson, Encoding.UTF8, "application/json");
var httpResponseMessage = httpClient.PostAsync(apiUrl, requestContent).ConfigureAwait(false).GetAwaiter().GetResult();
//reading the response from OpenAI API as JSON format string
var jsonString = httpResponseMessage.Content.ReadAsStringAsync().ConfigureAwait(false).GetAwaiter().GetResult();
try
{
//collecting the actual embeddings and transforming the into an array of doubles
JObject jsonObject = JObject.Parse(jsonString);
JArray embeddingArray = jsonObject["data"][0]["embedding"] as JArray;
embeddings = embeddingArray.Select(e => (double)e).ToArray();
}
catch (Exception ex)
{
Log.Warn("Error parsing Embedding data: " + ex.Message);
embeddings = new double[0];
}
//setting the embeddings as a new output property
item.Set<List<double>>(property.Value, embeddings.ToList());
}