Regular Expressions

Use regular expressions to match patterns in your input documents. For example, match addresses, URLs, phone numbers, employee numbers, and so on.

For general information about the regular expressions that are allowed with AutoClassifier, see Microsoft Regular Expression information.

Writing Regular Expression Rules

A rule that specifies a regular expression must have a REGEX prefix:

  • regex(rule)
  • REGEX(rule)
  • regex(“rule”)
  • REGEX(“rule”)

See the following REGEX rule example that matches a URL such as https://www.youtube.com/:

REGEX("\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))")

You can also write a REGEX rule that includes a Boolean operator such as:

REGEX(rule) OR "term"

Specifying Regular Expression Special Characters

You can write rules using special characters within your regular expression syntax.

See the following special characters and their examples:

  • *: Matches the preceding character or sub-expression, zero or more times:

regex(cats*)

Test results include documents with cat, cats, catss.

  • +: Matches the preceding character or sub-expression, one or more times:

regex(cats+)

Test results include documents with cat, cats, but not catss.

  • ?: Matches the preceding character or sub-expression, zero or one time:

regex(cats?)

Test results include documents with cat, cats, but not catss.

The ? symbol is related only to the last character or subexpression. For this reason, if there are no matches for this character or subexpression in the input document, the document is not matched.

  • ^: Matches the position at the start of the searched string. If the m (multi-line search) character is included with the flags, ^ also matches the position following \n or \r:

regex(^\d{3})

Test results include documents with three numeric digits at the start of the searched string.

  • ^: When used as the first character in a bracket expression, ^ negates the character set:

regex([^abc])

Test results include documents with any character except a, b, and c.

  • $: Matches the position at the end of the searched string. If the m (multiline search) character is included with the flags, ^ also matches the position before \n or \r:

regex(\d{3}$)

Test results match three numeric digits at the end of the searched string.

  • .: Matches any single character except the newline character \n. To match any character including the \n, specify a pattern such as [\s\S]:

regex(a.c)

Test results match abc, a1c, and a-c.

  • []: Matches the start and end of a bracket expression:

regex([1-4])
Test results match 1, 2, 3, or 4.

regex([^aAeEiIoOuU])

Test results match any non-vowel character.

  • {}: Matches the start and end of a quantifier expression:

regex(a{2,3})

Test results match aa and aaa.

  • (): Matches the start and end of a subexpression:

Regex(A(\d))

Test results match A0 to A9.

  • |: Indicates a match on a choice between two or more terms:

Regex(z|food)

Test results match Z or food.

  • |: Indicates a match on a choice between two or more letters:

Regex((z|f)ood)

Test results match Zood or food.

  • Combine characters in a rule:

regex(\d{3})

This rule matches a document with three digits in a row such as 123 or 594.

  • Combine properties in single rule:

regex("^[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$")

This rule matches an email address such as myemail@google.com.