Data source settings

After opening a data file you have to make a number of settings to make sure that the source data is interpreted and grouped the way you want. These settings are found on the Settings pane at the left.

Input data settings (Delimiters)

The Input Data settings (on the Settings pane at the left) specify how the input data must be interpreted. These settings are different for each data type. For a CSV file, for example, it is important to specify the delimiter that separates data fields. PDF files are already delimited naturally by page, so the only input data settings for PDF files are instructions on how to parse the text on each page.
For an overview of all options, see: Input Data.

Settings for a CSV File

In a CSV file, data is read line by line, where each line may contain multiple fields, separated by a delimiter. Even though CSV stands for comma-separated values, fields may be separated using any character, including commas, tabs, semicolons, and pipes.
The text delimiter is used to wrap around each field just in case the field values contain the field separator. This ensures that, for example, the field “Smith; John” is not interpreted as two fields, even if the field delimiter is the semicolon.
For an explanation of all the options, see: CSV file Input Data settings.

Settings for an Excel File

For an Excel file you have to specify which sheet to use. You can also set how many lines should be skipped, if the first row contains field names or not, and how the data should be sorted. See: Excel file Input Data settings.

Excel has its own way to display dates. You can specify if the Data Viewer should display dates just as Excel does, or not. That's important because extracting a Date value will only be successful if the expected date format matches the actual format of a date in the Data Viewer.
This Editor Data Format setting is found on the Settings pane; see: Editor Data Format.
The expected date format is set somewhere else; see: Default Data Format.

Settings for a PDF File

PDF files have a clear and unmovable delimiter: pages. So, the Input Data settings are not used to set delimiters. Instead, these options determine how words, lines and paragraphs are detected when you select content in the PDF to extract data from it.
For an explanation of all the options, see: PDF file Input Data settings.

Settings for a database

Databases all return the same type of information. Therefore the Input Data options for a database refer to the tables inside the database. Clicking on any of the tables shows the first line of the data in that table.
If the database supports stored procedures, including inner joins, grouping and sorting, you can use custom SQL to make a selection from the database, using whatever language the database supports. The query may contain variables and properties, so that the selection will be dynamically adjusted each time the data mapping configuration is actually used in a Workflow process; see Using variables and properties in an SQL query.
For an explanation of all the options, see: Database Input Data settings.

Settings for a text file

Because text files have many different shapes and sizes, there are a lot of input data settings for these files. You can add or remove characters in lines if it has a header you want to get rid of, or unwanted characters at the beginning of your file, for example; you can also set a line width if you are still working with old line printer data; etc.
It is important that pages be defined properly. This can be done either by using a set number of lines or using a string of text (for example, the character “P”), to detect on the page. Be aware that this is not a Boundary setting; it detects each new page, not each new record.
For an explanation of all the options, see: Text file Input Data settings.

Settings for an XML file

XML is a special file format because these file types can have a theoretically unlimited number of structure types. The input data has three options that basically determine at which node level a new record is created. You can:

  • Select an element type to create a new delimiter every time that element is encountered.

  • Enter an XPath to create a delimiter based on the node name of elements. The Show all elements option allows you to extract information from all elements, even when the delimiter is set to a specific lower-level element.

  • Use the root node. If there is only one top-level element, there will only be one record before the Boundaries are set.

See also: XML File Input Data settings.

Note: The DataMapper only extracts elements for which at least one value is defined in the file.

Settings for a JSON file

For JSON files there are two input data options that determine where a new source record starts.
You can either use the object or array at the root and get one output record, or select an object or array as parent element. Its direct child elements - objects and arrays, not key-value pairs - are then seen as individual source records. Any elements at the same level as the parent element or at a higher level are repeated in each source record.

See also: JSON File Input Data settings.

Record boundaries

Boundaries are the division between records: they define where one record ends and the next record begins. Using boundaries, you can organize the data the way you want.
You could use the exact same data source with different boundaries in order to extract different information. If, for instance, a PDF file contains multiple invoices, each invoice could be a record, or all invoices for one customer could go into a single record.
Keep in mind that when the data is merged with a template, each record generates output (print, email, web page) for a single recipient.

To set a boundary, a specific trigger must be defined.
The trigger can be a natural delimiter between blocks of data, such as a row in a CSV file or a page in a PDF file.
It can also be something in the data that is either static (for example, the text "Page 1 of" in a PDF file) or changing (a customer ID, a user name, etc).
To define a more complex trigger, you can write a script (see Setting boundaries using JavaScript). This option is not available with XML and JSON files.
A new record cannot start in the middle of a data field, so if the trigger is something in the data, the boundary will be set on the nearest preceding natural delimiter. If for instance in a PDF file the text "Page 1 of" is used as the trigger, the new record starts at the page break before that text.

For an explanation of all Boundaries options per file type, see Boundaries.

Data format settings

By default the data type of extracted data is a HTMLString, but each field in the Data Model can be set to contain another data type (see Data types). When that data type is Date, Number or Currency, the DataMapper will expect the data in the data source to be formatted in a certain way, depending on the settings.

The default format for dates, numbers and currencies can be set in three places: in the user preferences, in the data source settings, and per field in the Data Model.

By default, the user preferences are set to the system preferences. These user preferences become the default format values for any newly created data mapping configuration. To change these preferences, select Window > Preferences > DataMapper > DataMapper default format (see DataMapper preferences).

Data format settings defined for a data source apply to any new extraction made in the current data mapping configuration. These settings are made on the Settings pane; see Settings pane.

Settings for a field that contains extracted data are made via the properties of the Extract step that the field belongs to (see Setting the data type). Any format settings specified per field are always used, regardless of the user preferences or data source settings.

Note: Data format settings tell the DataMapper how to read and parse data from the data source. They don't determine how these data are formatted in the Data Model or in a template. In the Data Model, data are converted to the native data type. Dates, for example, are converted to a DateTime object. How they are displayed in the Data Model depends on the preferences (see Default Format).