How to Capture Metadata
Metadata Values Capture Component
This pipeline stage captures values for the configured metadata that was collected during item processing, with the scope of obtaining visibility on metadata value frequency and using this information as Machine Learning training data.
The metadata values are sorted by number of occurrences. Stemming is considered when comparing values. After exporting captured data as CSV, there it is recommended that you perform a manual cleanup / intervention. Later, you can use the cleaned up metatdata to sort new items by metadata values and end up with Machine Learning training data sets.
Below, you can see a sample of captured information that contains the most frequently met metadata values along with the documents they were found in.
How to Add the Metadata Values Component
- Navigate to the AutoClassifier Pipelines component page.
- Expand the New Component section and select Metadata Values Capture from the component list:
- Name your new component, and click the Add button the in lower right corner.
- Click Apply to save your changes.
- The Metadata Values Capture component is added to the list of your Existing Components.
How to Configure the Metadata Values Capture Component
To open the configuration menu, click the name of the component in the list of existing components.
- Captured Property Name:
- This is the name of the metadata value to be captured.
- This is the name of the metadata value to be captured.
- Unique Identifier Name:
- This is the property (name) that stores the unique identifier of the item.
- For example, escbase_crawlurl if you want to capture metadata while crawling with Connectivity Hub.
- Language Property Name:
- This is the property (name) that holds the item language.
- This is the property (name) that holds the item language.
- No. of metadata captured per item:
- The number of metadata values to be captured for each item.
- The number of metadata values to be captured for each item.
- Number of items to export to .csv:
- This is the number of items that are exported to a .csv file when you click the Download Captured Data button (all captured metadata will be exported if this field is left blank).
- This is the number of items that are exported to a .csv file when you click the Download Captured Data button (all captured metadata will be exported if this field is left blank).
- Download Captured Data: The captured metadata will be exported to a .csv file under the format CapturedPropertyName_GUID.csv.
- Each row of the .csv will contain the follwing:
- The captured metadata value
- The number of occurrences of this metadata
- The items that contain this metadata value
- The file is ordered from highest to lowest by number of hits
- Each row of the .csv will contain the follwing:
- Delete Property Captured Data:
- Clear all captured metadata values for a specific property (specified in Captured Property Name).
- Clear all captured metadata values for a specific property (specified in Captured Property Name).
- Delete All Captured Data:
- Delete all captured metadata.
Item Sorter
This pipeline stage saves on disk items or metadata of items sorted according to their metadata and the CSV values file. This information is captured using the Metadata Values Capturer stage during item processing, with the scope of obtaining groups of similar documents or metadata values (for example document Summaries), that can be used as Machine Learning Training Data.
The advantage to this stage is that the Machine Learning Models are trained based on current set of data and metadata values, tags, etc. Essentially, this stage helps the machine learning model to learn from the data that customers already have and use for new incoming data.
You can decide not to use the entire document as data capturer, but an alternate metadata value of it that can be more relevant, such as Summaries
How to Add the Item Sorter Component
- Navigate to the AutoClassifier Pipelines component page.
- Click New Componentand select Item Sorter from the components list.
- Name your new component, and click the Add button the in lower right corner.
- Click Apply to save your changes.
- The Item Sorter component will be added to your Existing Components list.
How to Configure the Item Sorter Component
This component saves on disk files sorted according to their metadata and the .csv file captured using the Metadata Capturing Stage.
- Captured Property Name:
- This is the name of the metadata value that was captured using the Metadata Values Capturer stage
- Accepted Values CSV File Path:
- This is the path to the .csv file exported from the Metadata Values Capturer stage.
- Captured Files Storage Location:
- This is the location where the Item Sorter will create the new folders and files.
- File Name Property Name:
- This is the property (name) that holds the item name.
- Language Property Name:
- This is the property (name) that holds the item language.
- Alternate Metadata Used For Capture:
- Property (name) that holds other relevant metadata (for example the DocumentSummary property).
- Free disk space limit (GB):
- If the available disk space is lower than this limit, the Item Sorter will no longer write files to the disk.
Example of Disk Output
The disk output for the Item sorter produces sub-folders with the metadata value name. In each sub-folder, you can find a summary of relevant information for the metadata value is captured.