Connector Scripting
-
How to Manipulate a Crawled Item Before It Leaves the Web Service
-
How to Write a Custom Stage Using the Scripting Stage Template
How to Manipulate a Crawled Item Before It Leaves the Web Service
To manipulate a crawled item before it leaves the web service you must implement the IProcessingStage interface
and configure the Connector to call your implementation.
-
We refer to the class implementing the
IProcessingStage
as a scripting stage. -
The only member that requires implementation is the
ProcessItem
.
This method has two parameters:
void ProcessItem(ItemReturn itemReturn, ConnectionParams param);
ItemReturn
represents the object to be modified and ConnectionParams
contains connection related information: contentId
, CrawlId
, ItemId
, FolderId
, and so on.
public class ItemReturn: ReturnInterface
{
public MetaDataValue[] metadata; // list of values to be added to the index
public string[] grantUsers; // list of user id's that can access the document
public string[] denyUser; // list of user id's that are explicitly denied access. overrides grant
public string[] grantGroups; // list of group id's that can access the document
public string[] denyGroups; // list of group id's that are explicitly denied access. overrides grant
// for incremental crawling this must be set to the last update date of the item.
public DateTime lastUpdateDate = DateTime.Now.ToUniversalTime();
public bool isPublic = false; // means marked as public in source system.
public bool notFound = false; // removes from index
public bool noChange = false; // tells index not to update the item, not change since the lastUpdateDate
/// <summary> /// is there a file and how is the data being returned
///BE CAREFUL!!!!. If not a temp file then change type to permanent else it will be deleted during the crawl
/// </summary> public FileReturnType fileRetType = FileReturnType.NoFile;
// if temp file created this is the path to it
public string filePath;
public string fileExtension; //overrides the original extension
public byte[] fileData; // if passing the file in bytes and not path. Then leave filePath blank.
public bool fileIsNativeZip = false; //set this to true when zip type file is not created by you.
//use these fields for fixed values. They are ignored unless you Set DiscoveryInfo correctly
public string title = ""; // set TitleEditable = false;
public string author = ""; // set AuthorEditable = false;
public string url = ""; // set URLEditable = false
public string container = ""; // set ContainerURLEditable = false
// default mailbox fields . these are only valid for mailbox types
public string propemailfolder = "";
public string propemailfoldertype = "";
public string propemailmailbox = "";
//standard props - not required but helpful if available
public string propdescription = "";
public string propsubject = "";
public string propversionlabel = "";
public string propstatus = "";
public string propkeywords = "";
//if available
public string dataStoreTypeID = "";
public string dataStoreTypeName = "";
public byte[] binaryACL;
public bool binaryACLReturned = false;
public string[] propemailTo;
public string propemailFrom = "";
public string[] propemailCC;
public DateTime propemailSentRedv = DateTime.MinValue;
}
public class ConnectionParams
{
public ConnectionInfo ConnectionInformation {
get;
set;
}
public string Id {
get;
set;
}
public string SubId {
get;
set;
}
public string FolderId {
get;
set;
}
public DataStoreInfo DataStoreInformation {
get;
set;
}
public int MaxFileSize {
get;
set;
}
public DateTime LastUpdate {
get;
set;
}
public bool IsIncremental {
get;
set;
}
public int CrawlID {
get;
set;
}
public int ContentID {
get;
set;
}
public string Crawler {
get;
set;
}
}
How to Extract Text from Different File Types
The TikaExtractorStage
is a built-in stage in Connectivity Hub that uses the Tika library to extract text from over 1,000 different file types.
-
This class implementation is found in the
SPWorks.Search.Service.Interfaces
assembly.
To make a Connector use this stage, modify the Connector web.config file to reference this stage:
-
Go to the configuration node and add a new section as shown below:
Config node: new sectionCopy<configSections>
<sectionGroup name="applicationSettings" type="System.Configuration.ApplicationSettingsGroup, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089">
<section name="SPWorks.Search.Service.Interfaces.Properties.Settings" type="System.Configuration.ClientSettingsSection, System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false" />
</sectionGroup>
</configSections> -
Under the configuration node, add the
applicationSettings
node as shown below:Config node: applicationSettings nodeCopy<applicationSettings>
<SPWorks.Search.Service.Interfaces.Properties.Settings>
<setting name="ProcessingEnabled" serializeAs="String">
<value>True</value>
</setting>
<setting name="MaxCharacters" serializeAs="String">
<value>200000</value>
</setting>
<setting name="MinSizeMb" serializeAs="String">
<value>60</value>
</setting>
<setting name="ProcessingStages" serializeAs="Xml">
<value>
<ArrayOfString xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<string>SPWorks.Search.Service.Interfaces.Processing.TikaExtractorStage, SPWorks.Search.Service.Interfaces</string>
</ArrayOfString>
</value>
</setting>
</SPWorks.Search.Service.Interfaces.Properties.Settings>
</applicationSettings> - Run the Content Test Bench to confirm the changes.
- In the Connector log, at the debug level, you see the following:
DEBUG (null) - [iCustomConnector2: ItemData] Calling processing framework for item with id 88
DEBUG (null) - [ProcessingFramework] Calling initialize method
DEBUG (null) - [ProcessingFramework] Checking if stage TikaExtractorStage should be processed or not
DEBUG (null) - [TikaExtractorStage] Processing item started for id: 88
How to Write a Custom Stage Using the Scripting Stage Template
To help you write a custom stage, there is another built-in stage template in Connectivity Hub that is called the ScriptingStage
.
-
This stage implements the required interface, but lets you define the stage body: you write the body of the method
ProcessItem()
. -
The following example replaces the unstructured data for all items with “Hello world!”:
-
Add this text into a file named
ScriptExample.cs
and remember its path:
item.fileData = System.Text.Encoding.UTF8.GetBytes("Hello world!");
-
To make a Connector use this stage, you must modify the connector web.config file (<connector installation root>/Admin site directory) to reference this stage.
In the configSection
node, add the new section shown below:
<configSections>
<section name="SPWorks.Search.Service.Interfaces.Properties.Settings" type="System.Configuration.ClientSettingsSection, System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false" />
</sectionGroup>
Add the applicationSettings node and specify all of the stage details: full class name and the path to the file containing the ProcessItem() method implementation.
<applicationSettings>
<SPWorks.Search.Service.Interfaces.Properties.Settings>
<setting name="ProcessingEnabled" serializeAs="String">
<value>True</value>
</setting>
<setting name="ScriptingFilePath" serializeAs="Xml">
<value>
<ArrayOfString xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<string>ScriptExample.cs</string>
</ArrayOfString>
</value>
</setting>
<setting name="ProcessingStages" serializeAs="Xml">
<value>
<ArrayOfString xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<string>SPWorks.Search.Service.Interfaces.Processing.ScriptingStage, SPWorks.Search.Service.Interfaces, Version=1.0.0.1, Culture=neutral, PublicKeyToken=a39cd09d2bd50b98</string>
</ArrayOfString>
</value>
</setting>
</SPWorks.Search.Service.Interfaces.Properties.Settings>
</applicationSettings>
You can add as many stages as you want:
<value>
<ArrayOfString xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<string>SPWorks.Search.Service.Interfaces.Processing.ScriptingStage, SPWorks.Search.Service.Interfaces, Version=1.0.0.1, Culture=neutral, PublicKeyToken=a39cd09d2bd50b98</string>
<string>Full class name for Stage2</string>
<string> Full class name for Stage3</string>
</ArrayOfString>
</value>
Note: If ProcessingEnabled
(see this setting in the web.config
file) is set to false
, all of the stages are skipped.
How to Change Metadata, Security, and Unstructured Data
You can implement stages to change metadata, security, and/or unstructured data.
-
Depending on the stage description, specify at least one attribute that will define what area your stage affects.
-
For example, the TikaExtractor stage uses this tag:
[AppliedProcessing(ProcessingTypeFlags.BlobProcessing)]
-
If you select Skip FileData stages from the Test Bench, the Tika stage is be skipped.
You can define the ScriptingStage.template
to change any of these (metadata, security, or blob).
For this reason, this class applies all tags:
namespace SPWorks.Search.Service.Interfaces.Processing
{
[AppliedProcessing(ProcessingTypeFlags.BlobProcessing | ProcessingTypeFlags.MetadataProcessing | ProcessingTypeFlags.SecurityProcessing)]
public class ScriptingStage : IProcessingStage
-
The
StageProcessing
class below implements theIProcessingStage
. -
To get access to the
IProcessingStage
interface, your project must reference theSPWorks.Search.Service.Interfaces
assembly. -
StageProcessing
reads the item file data, which is considered to be an XML file, and logs its nodes.
using BAInsight.Logging;
using SPWorks.Search.Service.Interfaces;
using SPWorks.Search.Service.Interfaces.Attributes;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml;
namespace BodyProcessingStage
{
[AppliedProcessing(ProcessingTypeFlags.BlobProcessing)]
public class StageProcessing: IProcessingStage
{
private static readonly ILogger log = LogManager.GetLogger(typeof (StageProcessing));
public void ProcessItem(ItemReturn itemReturn, ConnectionParams param)
{
log.Info("Calling ProcessItem in Stageprocessing for item with id:{0} from content {1}", param.Id, param.ContentID);
var body = itemReturn.fileData;
XmlDocument DOC = new XmlDocument();
string xml = Encoding.UTF8.GetString(body);
DOC.LoadXml(xml);
XmlNodeList ParentNode = DOC.GetElementsByTagName("customer");
foreach(XmlNode AllNodes in ParentNode)
{
if (ParentNode == DOC.GetElementsByTagName("customerMiddleInitial"))
{
log.Info(AllNodes["customerMiddleInitial"].InnerText);
}
if (ParentNode == DOC.GetElementsByTagName("name"))
{
log.Info(AllNodes["FirstName"].InnerText);
log.Info(AllNodes["LastName"].InnerText);
}
if (ParentNode == DOC.GetElementsByTagName("customerBirth"))
{
log.Info(AllNodes["customerBirth"].InnerText);
}
}
}
}
}
Modify File Web.config
To cause a connector to use this stage, you must modify the Connector file web.config
to reference this stage.
-
In the configuration node add a new section as shown below:
New Section in Configuration NodeCopy<configSections>
<sectionGroup name="applicationSettings" type="System.Configuration.ApplicationSettingsGroup, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089">
<section name="SPWorks.Search.Service.Interfaces.Properties.Settings" type="System.Configuration.ClientSettingsSection, System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false" />
</sectionGroup>
</configSections> -
Under the configuration node, add the
applicationSettings
node as shown below:applicationSettings NodeCopy<applicationSettings>
<SPWorks.Search.Service.Interfaces.Properties.Setting s>
<setting name="ProcessingEnabled" serializeAs="String">
<value>True</value>
</setting>
<setting name="ProcessingStages" serializeAs="Xml">
<value>
<ArrayOfString xmlns:xsi="http://www.w3.org/2001/XMLSchema- instance"
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<string> BodyProcessingStage.StageProcessing,BodyProcessingStage, Version=1.0.0.0, Culture=neutral, PublicKeyToken=a39cd09d2bd50b98</string>
</ArrayOfString>
</value>
</setting>
</SPWorks.Search.Service.Interfaces.Properties.Settings>
</applicationSettings>