How to Detect Duplicates

Overview
How to Add the Duplicates Detection Component
How to Configure the Duplicates Detection Component

Overview

The Duplicates Detection stage uses the configured input metadata to calculate the document hash and tries to find the closest matching document based on the Jaccard similarity index.

The document hash is compared to other pre-existing clusters and if a match greater than the specified threshold is found then the GUID of the cluster with the highest similarity is returned as metadata.
If no match is found then the document hash is added to the database and will be used in future comparisons.
In this case the output property is the GUID of the newly added cluster.
The Duplicates Detection stage only handles comparisons, not text extractions.
Additional stages might need to be used to extract the information we want to compare.

How to Add the Duplicates Detection Component

Procedure:

Navigate to the AutoClassifier Pipelines component page.
Click New Component and select Duplicates Detection from the component list:
Name your new component, and click the Add button the in lower right corner.
Click Apply to save your changes.
The Duplicates Detection component will be added to your Existing Components list.

How to Configure the Duplicates Detection Component

Similarity Detection Property
- Name of the metadata value to be used for Duplicates Detection
- Default: body
Similarity threshold(%)
- Percentage of similarity for the cluster to be declared a duplicate
Delete All Clusters from Database
- Delete all Cluster IDs from the database

Output Properties

Property	Type	Multivalue	Description
DuplicateDetectionClusterGuid	GUID	No	The id of the closest duplicate cluster