Connector Scripting

How to Manipulate a Crawled Item Before It Leaves the Web Service

To manipulate a crawled item before it leaves the web service you must implement the IProcessingStage interface and configure the Connector to call your implementation.

  • We refer to the class implementing the IProcessingStage as a scripting stage.

  • The only member that requires implementation is the  ProcessItem.

This method has two parameters:

void ProcessItem(ItemReturn itemReturn, ConnectionParams param);

ItemReturn represents the object to be modified and ConnectionParams contains connection related information: contentId, CrawlId, ItemId, FolderId, and so on.

ItemReturn and ConnectionParams Parameters
Copy
public class ItemReturn: ReturnInterface
{
  public MetaDataValue[] metadata; // list of values to be added to the index
  public string[] grantUsers; // list of user id's that can access the document
  public string[] denyUser; // list of user id's that are explicitly denied access. overrides grant
  public string[] grantGroups; // list of group id's that can access the document
  public string[] denyGroups; // list of group id's that are explicitly denied access. overrides grant
  // for incremental crawling this must be set to the last update date of the item.
  public DateTime lastUpdateDate = DateTime.Now.ToUniversalTime();
  public bool isPublic = false; // means marked as public in source system.
  public bool notFound = false; // removes from index
  public bool noChange = false; // tells index not to update the item, not change since the lastUpdateDate
  /// <summary>  /// is there a file and how is the data being returned
  ///BE CAREFUL!!!!. If not a temp file then change type to permanent else it will be deleted during the crawl
  /// </summary>  public FileReturnType fileRetType = FileReturnType.NoFile;
  // if temp file created this is the path to it
  public string filePath;
  public string fileExtension; //overrides the original extension
  public byte[] fileData; // if passing the file in bytes and not path. Then leave filePath blank.
  public bool fileIsNativeZip = false; //set this to true when zip type file is not created by you.
  //use these fields for fixed values. They are ignored unless you Set DiscoveryInfo correctly
  public string title = ""; // set TitleEditable = false;
  public string author = ""; // set AuthorEditable = false;
  public string url = ""; // set URLEditable = false
  public string container = ""; // set ContainerURLEditable = false
  // default mailbox fields . these are only valid for mailbox types
  public string propemailfolder = "";
  public string propemailfoldertype = "";
  public string propemailmailbox = "";
  //standard props - not required but helpful if available
  public string propdescription = "";
  public string propsubject = "";
  public string propversionlabel = "";
  public string propstatus = "";
  public string propkeywords = "";
  //if available
  public string dataStoreTypeID = "";
  public string dataStoreTypeName = "";
  public byte[] binaryACL;
  public bool binaryACLReturned = false;
  public string[] propemailTo;
  public string propemailFrom = "";
  public string[] propemailCC;
  public DateTime propemailSentRedv = DateTime.MinValue;
}
public class ConnectionParams
{
  public ConnectionInfo ConnectionInformation {
    get;
    set;
  }
  public string Id {
    get;
    set;
  }

  public string SubId {
    get;
    set;
  }
  public string FolderId {
    get;
    set;
  }
  public DataStoreInfo DataStoreInformation {
    get;
    set;
  }
  public int MaxFileSize {
    get;
    set;
  }
  public DateTime LastUpdate {
    get;
    set;
  }
  public bool IsIncremental {
    get;
    set;
  }
  public int CrawlID {
    get;
    set;
  }
  public int ContentID {
    get;
    set;
  }
  public string Crawler {
    get;
    set;
  }
}

How to Extract Text from Different File Types

The TikaExtractorStage is a built-in stage in Connectivity Hub that uses the Tika library to extract text from over 1,000 different file types.

  • This class implementation is found in the SPWorks.Search.Service.Interfaces assembly.

To make a Connector use this stage, modify the Connector web.config file to reference this stage:

  1. Go to the configuration node and add a new section as shown below:

    Config node: new section
    Copy
    <configSections>
        <sectionGroup name="applicationSettings" type="System.Configuration.ApplicationSettingsGroup, System, Version=2.0.0.0, Culture=neutral,     PublicKeyToken=b77a5c561934e089">
        <section name="SPWorks.Search.Service.Interfaces.Properties.Settings" type="System.Configuration.ClientSettingsSection, System, Version=4.0.0.0,     Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false" />
        </sectionGroup>
    </configSections>
  2. Under the configuration node, add the applicationSettings node as shown below:

    Config node: applicationSettings node
    Copy
    <applicationSettings>
       <SPWorks.Search.Service.Interfaces.Properties.Settings>
          <setting name="ProcessingEnabled" serializeAs="String">
             <value>True</value>
          </setting>
          <setting name="MaxCharacters" serializeAs="String">
            <value>200000</value>
          </setting>
          <setting name="MinSizeMb" serializeAs="String">
            <value>60</value>
          </setting>
          <setting name="ProcessingStages" serializeAs="Xml">
             <value>
                <ArrayOfString xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                   xmlns:xsd="http://www.w3.org/2001/XMLSchema">
                   <string>SPWorks.Search.Service.Interfaces.Processing.TikaExtractorStage, SPWorks.Search.Service.Interfaces</string>
                </ArrayOfString>
             </value>
          </setting>
       </SPWorks.Search.Service.Interfaces.Properties.Settings>
    </applicationSettings>
  3. Run the Content Test Bench to confirm the changes.

  4. In the Connector log, at the debug level, you see the following:
Copy
DEBUG (null) - [iCustomConnector2: ItemData] Calling processing framework for item with id 88
DEBUG (null) - [ProcessingFramework] Calling initialize method
DEBUG (null) - [ProcessingFramework] Checking if stage TikaExtractorStage should be processed or not
DEBUG (null) - [TikaExtractorStage] Processing item started for id: 88

How to Write a Custom Stage Using the Scripting Stage Template

To help you write a custom stage, there is another built-in stage template in Connectivity Hub that is called the ScriptingStage.

  • This stage implements the required interface, but lets you define the stage body: you write the body of the method ProcessItem().

  • The following example replaces the unstructured data for all items with “Hello world!”:

  • Add this text into a file named ScriptExample.cs and remember its path:

item.fileData = System.Text.Encoding.UTF8.GetBytes("Hello world!");

  • To make a Connector use this stage, you must modify the connector web.config file (<connector installation root>/Admin site directory) to reference this stage.

In the configSection node, add the new section shown below:

configSection node: "Hello world!"
Copy
<configSections>
<section name="SPWorks.Search.Service.Interfaces.Properties.Settings" type="System.Configuration.ClientSettingsSection, System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false" />
</sectionGroup>
Add the applicationSettings node and specify all of the stage details: full class name and the path to the file containing the ProcessItem() method implementation.
<applicationSettings>
   <SPWorks.Search.Service.Interfaces.Properties.Settings>
      <setting name="ProcessingEnabled" serializeAs="String">
         <value>True</value>
      </setting>
      <setting name="ScriptingFilePath" serializeAs="Xml">
         <value>
            <ArrayOfString xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xmlns:xsd="http://www.w3.org/2001/XMLSchema">
               <string>ScriptExample.cs</string>
            </ArrayOfString>
         </value>
      </setting>
      <setting name="ProcessingStages" serializeAs="Xml">
         <value>
            <ArrayOfString xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xmlns:xsd="http://www.w3.org/2001/XMLSchema">
               <string>SPWorks.Search.Service.Interfaces.Processing.ScriptingStage, SPWorks.Search.Service.Interfaces, Version=1.0.0.1, Culture=neutral, PublicKeyToken=a39cd09d2bd50b98</string>
            </ArrayOfString>
         </value>
      </setting>
   </SPWorks.Search.Service.Interfaces.Properties.Settings>
</applicationSettings>
You can add as many stages as you want:
<value>
   <ArrayOfString xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xmlns:xsd="http://www.w3.org/2001/XMLSchema">
      <string>SPWorks.Search.Service.Interfaces.Processing.ScriptingStage, SPWorks.Search.Service.Interfaces, Version=1.0.0.1, Culture=neutral, PublicKeyToken=a39cd09d2bd50b98</string>
      <string>Full class name for Stage2</string>
      <string> Full class name for Stage3</string>
   </ArrayOfString>
</value>

Note: If ProcessingEnabled (see this setting in the web.config file) is set to false, all of the stages are skipped.

How to Change Metadata, Security, and Unstructured Data

You can implement stages to change metadata, security, and/or unstructured data.

  • Depending on the stage description, specify at least one attribute that will define what area your stage affects.

  • For example, the TikaExtractor stage uses this tag: [AppliedProcessing(ProcessingTypeFlags.BlobProcessing)]

  • If you select Skip FileData stages from the Test Bench, the Tika stage is be skipped.


You can define the ScriptingStage.template to change any of these (metadata, security, or blob).

For this reason, this class applies all tags:

Metadata, Security, Blob
Copy
namespace SPWorks.Search.Service.Interfaces.Processing
{
[AppliedProcessing(ProcessingTypeFlags.BlobProcessing | ProcessingTypeFlags.MetadataProcessing | ProcessingTypeFlags.SecurityProcessing)]
public class ScriptingStage : IProcessingStage
  • The StageProcessing class below implements the IProcessingStage.

  • To get access to the IProcessingStage interface, your project must reference the SPWorks.Search.Service.Interfaces assembly.

  • StageProcessing reads the item file data, which is considered to be an XML file, and logs its nodes.

IProcessingStage Implemented
Copy
using BAInsight.Logging;
using SPWorks.Search.Service.Interfaces;
using SPWorks.Search.Service.Interfaces.Attributes;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml;
namespace BodyProcessingStage
{
  [AppliedProcessing(ProcessingTypeFlags.BlobProcessing)]
  public class StageProcessing: IProcessingStage
  {
    private static readonly ILogger log = LogManager.GetLogger(typeof (StageProcessing));
    public void ProcessItem(ItemReturn itemReturn, ConnectionParams param)
    {
      log.Info("Calling ProcessItem in Stageprocessing for item with id:{0} from content {1}", param.Id, param.ContentID);
      var body = itemReturn.fileData;
      XmlDocument DOC = new XmlDocument();
      string xml = Encoding.UTF8.GetString(body);
      DOC.LoadXml(xml);
      XmlNodeList ParentNode = DOC.GetElementsByTagName("customer");
      foreach(XmlNode AllNodes in ParentNode)
      {
        if (ParentNode == DOC.GetElementsByTagName("customerMiddleInitial"))
        {
          log.Info(AllNodes["customerMiddleInitial"].InnerText);
        }
        if (ParentNode == DOC.GetElementsByTagName("name"))
        {
          log.Info(AllNodes["FirstName"].InnerText);
          log.Info(AllNodes["LastName"].InnerText);
        }
        if (ParentNode == DOC.GetElementsByTagName("customerBirth"))
        {
          log.Info(AllNodes["customerBirth"].InnerText);
        }
      }
    }
  }
}

Modify File Web.config

To cause a connector to use this stage, you must modify the Connector file web.config to reference this stage.

  1. In the configuration node add a new section as shown below:

    New Section in Configuration Node
    Copy
    <configSections>
       <sectionGroup name="applicationSettings" type="System.Configuration.ApplicationSettingsGroup, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089">
          <section name="SPWorks.Search.Service.Interfaces.Properties.Settings" type="System.Configuration.ClientSettingsSection, System, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false" />
       </sectionGroup>
    </configSections>
  2. Under the configuration node, add the applicationSettings node as shown below:

    applicationSettings Node
    Copy
    <applicationSettings>
       <SPWorks.Search.Service.Interfaces.Properties.Setting s>
       <setting name="ProcessingEnabled" serializeAs="String">
          <value>True</value>
       </setting>
       <setting name="ProcessingStages" serializeAs="Xml">
          <value>
             <ArrayOfString xmlns:xsi="http://www.w3.org/2001/XMLSchema- instance"
                xmlns:xsd="http://www.w3.org/2001/XMLSchema">
                <string> BodyProcessingStage.StageProcessing,BodyProcessingStage, Version=1.0.0.0, Culture=neutral, PublicKeyToken=a39cd09d2bd50b98</string>
             </ArrayOfString>
          </value>
       </setting>
       </SPWorks.Search.Service.Interfaces.Properties.Settings>
    </applicationSettings>