Help Center

Using Regular Expressions in Extractor Scripts

A Regular Expression, or Regex for short, is a special text string that is used to describe a search pattern. This search pattern provides a means to search for strings of text, such as particular characters, words, or patterns of characters. POSIX compatible Regular expressions are placed into Extractor scripts and are interpreted by the File Extractor.

In Regex, \s means any whitespace character, e.g. space or tab. \S means the opposite, that is, any character which is not whitespace. The + sign means one or more repetitions. So \S+ \s+ \S+ means a group of one or more non-whitespace characters (e.g. word) followed by some whitespace followed by another word.

For example, such a Regex will match the hello world string and when Regex is asked what has been matched it will return the whole string. However, it might be required to return the first word only, instead of the whole string, and at the same time keep matching both words just to ensure the string is not malformed. In such cases, parentheses can be used (\S+) \s+ \S+ which will ask Regex not only for the whole match but also for what is known as the first 'capture'. Regex will then return the first word. If (\S+) \s+ (\S+) is used then the second 'capture' can be asked for as well.

Dealing with captures by numbers may be inconvenient especially in complex Regex with nested parentheses. However, Regex allows each capture to be given a name. Named captures are referred to as 'named capture groups' or 'named subpatterns'. For example (\S+) represents an unnamed capture and (?<captureone> \S+) will turn it into a 'named capture group' called 'captureone'. The example Regex, (\S+) \s+ (\S+) can now be enhanced to include named capture groups, for example; (?<first>\S+) \s+ (?<second>\S+). This is exactly what is required for Extractor. However, in the File Extractor instead of using random names like first or second, the field names of the Record that the Extractor is working with are used.

Windows Example:

This example will show a simple Regex that is used to parse a NETSTAT command output. The output is:

TCP user-1:1085 file-srv.abc.local:netbios-ssn TIME_WAIT

The Regex to parse it and capture data for the first 3 fields looks as follows:

\s* (?<protocol> TCP|UDP) \s+ (?<localnam> \S+) : (?<localpor> \S+)

\s*

The asterisk means 0 or more repetitions so this matches optional whitespace before the data starts

(?<protocol>TCP|UDP)

Captures either TCP or UDP into PROTOCOL field

\s+

Matches compulsory whitespace (one or more) between 2 pieces of data

(?<localnam>\S+)

Captures next word into LOCALNAM field.

:

Semicolon matches itself - a semicolon which precedes the port number.

(?<localpor>\S+)

Captures port into LOCALPOR field.

HPE NonStop Example:

This example will show a simple Regex that is used to parse a PEEK POOL command output. The output is:

SYSPOOL 1648 1648 9493 1648 1648 0 0
EXTPOOL 1380 736 262143 1380 736 3 0
MAPPOOL 26001384 23904232 196589 25799610 23255134 29908 11092
FLEXPOOL 4717902 4455804 1048530 4481828 3870392 1153 247

The Regex to parse this output and capture data for the first 2 fields looks as follows:

\s* (?<IDFIELD> \S+ POOL ) \s+ (?<BIN64> \S+ )

\s*

The asterisk means 0 or more repetitions so this matches optional whitespace before the data starts

(?<IDFIELD>\S+ POOL)

Captures into IDFIELD field any non-whitespace characters \S repeated one or more times + and followed by “POOL”

\s+

Matches compulsory whitespace (one or more) between 2 pieces of data

(?<BIN64> \S+ )

Captures next word (as a combination of non-whitespace characters) into BIN64 field.

Notes

  • By default, Regex treats whitespace inside regular expressions literally so a b means 'a' followed by a space followed by 'b'. However, File Extractor uses a special Regex setting which ignores whitespace inside expressions. This is done to increase readability. Another setting makes Regex case-insensitive by default (this can be overridden for the whole expression or any part of it using special Regex syntax e.g. (?i) and (?-i)). Both settings can be changed using script instructions explained in the Extractor Script Instruction Entries.

  • An arbitrary mixture of fields and script variables can be used.

  • If a certain name does not belong to a field or variable then Extractor will store its data for optional use later. See Implementing Regular Expressions in Extractor Scripts for further details.

Provide feedback on this article