Input Data settings

The Input Data settings specify how the input data must be interpreted. These settings are different for each data type. For a CSV file, for example, it is important to specify the delimiter that separates data fields. PDF files are already delimited naturally by pages, so the input data settings for PDF files are interpretation settings for text in the file.

CSV file Input Data settings

In a CSV file, data is read line by line, where each line can contain multiple fields. The input data settings specify to the DataMapper module how the fields are separated.

  • Field separator: Defines what character separates each field in the file. Even though CSV stands for comma-separated values, CSV can actually refer to files where fields are separated using any character, including commas, tabs, semicolons, and pipes.

  • Text delimiter: Defines what character surrounds text in the file, preventing the Field separator from being interpreted within those text delimiters. This ensures that, for example, the field “Smith; John” is not interpreted as two fields, even if the field delimiter is the semicolon.

  • Comment delimiter: Defines what character starts a comment line.

  • Encoding: Defines what encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE or UTF-16LE ).

  • Lines to skip: Defines a number of lines in the CSV that will be skipped and not used as records.

  • Set tabs as a field separator: Overwrites the Field separator option and sets the Tab character instead for tab-delimited files.

  • First row contains field names: Uses the first line of the CSV as headers, which automatically names all extracted fields.

  • Ignore unparseable lines: Ignores any line that does not correspond to the settings above.

  • Skip empty lines: Ignore any line that has no content. Note that spaces are considered content.

    • Sort on: Select a field on which to sort the data, in ascending (A-Z) or descending (Z-A) order. Note that sorting is always textual. Even if the selected column has numbers, it will be sorted as a text.

Excel file Input Data settings

There are no settings for field separation in an Excel file, only settings with regards to the file as a whole.

  • Lines to skip: Defines a number of lines in the Excel file that will be skipped and not used as records.

  • First row contains field names: Check this option to use the first line of the Excel file as headers. This option automatically names all extracted fields.

  • Sheet: Only one sheet can be selected as the data source.

  • Skip empty lines: Ignore any line that has no content. Note that spaces are considered content.

    • Sort on: Select a field on which to sort the data, in ascending (A-Z) or descending (Z-A) order. Note that sorting is always textual. Even if the selected column has numbers, it will be sorted as a text.

PDF file Input Data settings

These settings also apply to files that are converted to PDF by the DataMapper: PS, PCL and AFP files.

PDF Files have a natural, static delimiter in the form of pages, so the options here are interpretation settings for text in the PDF file.

The Input Data settings for PDF files determine how words, lines and paragraphs are detected in the PDF when creating data selections.

Each value represents a fraction of the average font size of text in a data selection, meaning "0.3" represents 30% of the height or width.

  • Word spacing: Determines the spacing between words. As PDF text spacing is somehow done through positioning instead of actual text spaces, text position is what is used to find new words. This option determines what percentage of the average width of a single character needs to be empty to consider a new word has started. The default value is 0.3, meaning a space is assumed if there is a blank area of 30% of the width of the average character in the font.

  • Line spacing: Determines the spacing between lines of text. The default value is 1, meaning the space between lines must be equal to at least the average character height.

  • Paragraph spacing: Determines the spacing between paragraphs. The default value is 1.5, meaning the space between paragraphs must be equal to at least 1.5 times the average character height to start a new paragraph.

  • Magic number: Determines the tolerance factor for all of the above values. The tolerance is meant to avoid rounding errors. If two values are more than 70% away from each other, they are considered distinct; otherwise they are the same. For example, if two characters have a space of exactly the width of the average character, any space of between 0.7 and 1.43 of this average width is considered one space. A space of 1.44 is considered to be 2 spaces.

  • PDF file color space: Determines if the PDF if displayed in Color or Monochrome in the Data Viewer. Monochrome display is faster in the Data Viewer. This has no influence on the actual data extraction or the data mapping performance.

  • Processing method: This option determines which search method is used by the Goto step's Next Occurrence of option. The most recent method searches for text within the specified constraints. It is more precise and more reliable than the Original method which searches for the target text in the entire page and then determines if the text appears within the specified constraints.
    Both methods may sometimes return slightly different values. Data mapping configurations made with previous version of the software are therefore set to use the original method by default. It is however recommended to modify them to use the newer method.

  • Auto-Rotation (PS files only): If in the original PostScript file the orientation of a page and its text don't match, this option sets the page's orientation to match the orientation of the text.

Database Input Data settings

Databases all return the same type of information. Therefore the Input Data options for a database refer to the database itself instead of to the data.

The following settings apply to any database or ODBC Data Sample.

  • Connection String: Displays the connection string used to access the Data Source.

  • Browse button : Opens the Edit Database configuration dialog, which can replace the existing database data source with a new one. This is the same as using the Replace feature in the Data Samples window.

  • Table: Displays the tables and stored procedures available in the database. The selected table is the one the data is extracted from. Clicking on any of the tables shows the first line of the data in that table.

  • Custom SQL button : Click to open the SQL Query Designer (seeSQL Query Designer) and type in a custom SQL query. If the database supports stored procedures, including inner joins, grouping and sorting, you can use custom SQL to make a selection from the database, using whatever language the database supports.
    The query may contain variables and properties, so that the selection will be dynamically adjusted each time the data mapping configuration is actually used in a Workflow process; seeUsing variables and properties in an SQL query.

  • Encoding: Defines what encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE or UTF-16LE ).

  • Sort on: Allows to select a field on which to sort the data, in ascending (A-Z) or descending (Z-A) order. Note that sorting is always textual. Even if the sorting column has numbers, it will be sorted as a text.
    With a Custom Query, this option is not available.

  • Skip empty lines: Ignore any row that has no content, e.g. only nulls or empty strings. Note that spaces are characters and thus considered content.

Text file Input Data settings

Because text files have many different shapes and sizes, there are many options for the input data in these files.

  • Encoding: Defines what encoding is used to read the Data Source ( US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE or UTF-16LE ).

  • Selection/Text is based on bytes: Check for text files that use double-bytes characters (resolves width issues in some text files).

  • Add/Remove characters: Defines the number of characters to add to, or remove from, the head of the data stream. The spin buttons can also increment or decrement the value. Positive values add blank characters while negative values remove characters.

  • Add/Remove lines: Defines the number of lines to add to, or remove from, the head of the data stream. The spin buttons can also increment or decrement the value. Positive values add blank lines while negative values remove lines.

  • Maximum line length: Any line that is longer than the given maximum line length will be split at the maximum line length, as often as necessary. This option is used to cut (and wrap) long lines into logical blocks of data.
    The maximum value for this option is 65,535 characters. The default value is 80 characters.

  • Page delimiter type: Defines the delimiter between each page of data. Multiples of such pages can be part of a record, as defined by the Boundaries.

    • On lines: Triggers a new page in the Data Sample after a number of lines.

      • Cut on number of lines: Triggers a new page after the given number of lines. With this number set to 1, and the Boundaries set to On delimiter, it is possible to create a record for each and every line in the file.

      • Cut on FF: Triggers a new page after a Form Feed character.

    • On text: Triggers a new page in the Data Sample when a specific string is found in a certain location.

      • Word to find: Compares the text value with the value in the data source.

      • Match case: Activates a case sensitive text comparison.

      • Location: Choose Selected area or Entire width to use the value of the current data selection as the text value.

      • Left/Right: Use the spin buttons to set the start and stop columns to the current data selection (Selected area) in the record.

      • Lines before/after: This option places the delimiter a certain number of lines before or after the current line. This is useful if the text that triggers the delimiter is not on the first line of each page.

  • Text from right to left: Sets the writing direction of the data source to right-to-left.

  • Expand tabs to spaces: Replaces tabs with the given number of spaces.

  • Ignore CR/LF/FF at end of file: Instructs the DataMapper to ignore any CR, LF, FF or CR/LF characters when they are the last characters in a file. This prevents the addition of an unintended trailing record when the data mapping configuration is set up to cut on every line.

XML File Input Data settings

For an XML file you can either choose to use the root node, or select an element type, to create a new delimiter every time that element is encountered.

Note that the data file is not split. The XML tree is scanned from top to bottom (node by node), and when an element matches the record definition condition, the parsing mechanism stops. All parent or sibling nodes found before that element are displayed as part of the record. Any nodes located after the record definition node are not displayed (unless the Show all elements option is enabled). Parsing then starts again, at the top, and looks for the next element that matches the record definition condition, ignoring elements for which a record has already been created.

  • Use root element: Selects the top-level element. No other boundaries can be set. If there is only one top-level element, there will only be one record.

    This setting is not recommended for XML data files is larger than 5 - 10MB. For files this large, it would be better to implement a Pre-processor script to pre-process the XML data file, or even convert it into CSV format.

  • Use specific element: Displays a list containing all the elements in the XML file. Selecting an element causes a new page of data to be created every time an instance of this element is encountered.

    Note that higher-level nodes above the selected element are shown, but those below the selected element are not displayed unless the Show all elements option is checked.

  • Use XPath: Enter an XPath to create a delimiter based on the node name of elements. For example:./*[starts-with(name(),'inv')]sets a delimiter after every element of which the name starts with 'inv'. Note thatstarts-with()is an XPath function. For an overview of XPath functions, see Mozilla: XPath Functions.

    The XPath may also contain JavaScript code. In order to use JavaScript:Note that since the XPath is a string, the return value of the JavaScript statement will be interpreted as a string.

    • The XPath must start with =

    • The entire JavaScript statement must be enclosed in curly brackets: {...}

    • Any other curly brackets that are not part of the JavaScript code must be escaped with a backslash.

    • Single line comments (//...) in the code are not supported.

    Note: Currently, XPaths that select elements based on an attribute, attribute value, node value, node counter or node index are not supported.

  • Show all elements: When the delimiter is set to a specific element or XPath, checking this option allows to extract information from higher-level nodes, including those that follow the element or path. This might slow down the processing, so if you don't need any information from the higher-level nodes that follow that specific element, it is recommended to leave this option unchecked.

    When this option is used in combination with a trigger element that is not repeated at the same node level (in other words, it doesn't have a sibling with the same name), the entire XML document will be shown for each record, except the trigger element, which will only be shown for the record that is currently selected in the Data Model pane.

    This could lead to a problem with some steps that use an XPath with absolute indexes, such as a location-based Extract step. Using a dynamic index in the XPath will fix the problem. For example, in the case of a location-based Extract step, switch to extracting the data via JavaScript (see Expression-based field); in the JavaScript expression, replace the index of the element's parent in the XPath with record.index.

Note: The information contained in all of the selected parent nodes will be copied for each instance of that node. For example, if a client node contains multiple invoice nodes, the information for the client node can be duplicated for each invoice.

The DataMapper only extracts elements for which at least one value or attribute value is defined in the file.

JSON File Input Data settings

For a JSON file you can either use the object or array at the root and get one output record, or select an object or array as parent element. Its direct child elements - objects and arrays, not key-value pairs - can be output as individual records.

  • Use root element: Selects the top-level array or object. There will only be one record.

  • Use specific element: Select an array [ ] or object { } in the JSON data as Parent element to define its child elements - objects and/or arrays - as source records. Any elements outside the parent element and key-value pairs inside the parent will be repeated in each source record.

    Note: Only arrays and objects can be seen as a record. It is not possible to split the JSON between key-value pairs.