Help Center

Implementing Regular Expressions in Extractor Scripts

Using Regular Expressions (Regex) in File Extractor requires the RE and SetRE instructions. The former defines a Regex and the latter makes use of it. Other instructions are optional.

The RE Script Instruction

The RE <name> <expression> instruction declares Regex in the File Extractor script, as shown in the following example:

RE MyRegex1    \s* (?<protocol> TCP|UDP) \s+ (?<localnam> \S+) : (?<localpor> \S+)

The <name> must be a combination of any alphanumeric characters and underscores, no whitespace is allowed. Opposed to that, the <expression>, as already noted, can have whitespace which is ignored but can be used to make the text more legible.

An invalid Regex will trigger a script parsing error. The error message will contain a reference to the script line number and the <name>. In addition, the message will also specify why the Regex is invalid and provide an offset inside the <expression> which caused the Regex compilation to fail. As an example, for \S+\s+(\S+ the error message will state 'unmatched ( at offset 7'.

There is a limit of 32 regular expressions per script. All expressions are kept in compiled form and are cached inside an Extractor-wide cache to ensure that if the same Regex is used in more than one script then there is no duplication.

The SetRE Script Instruction

The SetRE <name> instruction applies a previously defined Regex to the Extractor input buffer and sets an arbitrary mixture of fields and extractor variables (both are case-insensitive) according to the data captured by Regex. For example:

SetRE MyRegex1

The instruction sets the intrinsic regexrc variable to 0 if Regex does not match, or to the count of matched characters if a match occurred. An example of a complete and working script which makes use of this variable (it does not have to be declared) is shown in the following example.

Record NETSTAT
FileType Text
RE MyRegex1 (?<PROTOCOL>TCP) \s+ (?<LOCALNAM>\S+) : (?<LOCALPOR>\S+) \s+ (?<FOREIGNN>\S+) : (?<FOREIGNP>\S+)
Label NewLine
SetRE MyRegex1
If regexrc = 0 NewLine
DeliverRecord
Goto NewLine

The ConsumeRE Script Instruction

The SetRE instruction is string oriented and reads the Extractor input buffer with the delimiters being ignored (however the EOL handler is called). It consumes the whole input buffer and advances the current position to the end of the buffer regardless of how much data was captured by the regular expression. The same whole buffer consuming behavior takes place even if there was no match. As a result, data which is of no interest is skipped. If SetRE finds the buffer is empty it reads the next line. This allows the script to move along the input data and read subsequent lines without making explicit coding effort.

However, this will mean that no more than one SetRE instruction can be applied to the same input data. A Record can contain many fields and capturing data for all of them would result in very lengthy expressions which would be difficult to maintain and understand. This translates into the need to apply more than one SetRE instruction to the same data.

For this reason, Extractor has the ConsumeRE [All | None | Match] instruction. The default is ALL which causes the whole buffer consuming behavior described above. When this setting is NONE, it makes the SetRE instruction 'still' so that it does not advance the Extractors current position to the end of the buffer. The position remains unaffected regardless of whether there was a match and how much data has been captured by Regex. Finally, MATCH makes Extractor advance the current position by the number of characters matched.

The following example rewrites the previous sample script and additionally captures into the STATE field the very last piece of NETSTAT output shown. Data from the fourth column is needed. The Regex for it should match any non-whitespace character repeated one or more times \S+ followed by some whitespace \s+. This combination has to be repeated three times {3} in order to skip 3 columns, which results in (\S+ \s+) {3}. After that the subsequent group of non-whitespace characters has to be captured into the STATE field (?<STATE> \S+). The rewritten script is shown below.

Record NETSTAT
FileType Text
ConsumeRE None
RE MyRegex1 (?<PROTOCOL>TCP) \s+ (?<LOCALNAM>\S+) : (?<LOCALPOR>\S+) \s+ (?<FOREIGNN>\S+) : (?<FOREIGNP>\S+)
RE MyRegex2 (\S+ \s+) {3} (?<STATE> \S+)
Label NewLine
SetRE MyRegex2
SetRE MyRegex1
SkipLines 1
If regexrc = 0 NewLine
DeliverRecord
Goto NewLine
  • In production script, depending on input data, it may be necessary to check the regexrc variable after each SetRE invocation. In this case the MyRegex1 is used after MyRegex2 thereby discarding the regexrc value set by MyRegex2.

  • An alternative (and better) way to write MyRegex2 is to capture a group of characters requiring that only optional whitespace \s* and nothing else can be found between this group and the EOL which is denoted in Regex by the dollar sign:

RE MyRegex2 (?<STATE> [A-Z_]+) \s* $

This version applies more rigid checking and instead of the non-whitespace character \S a combination of upper-case letters and underscore are being used.

The SetRE instruction uses captures named after fields and variables to set these according to data captured from the input buffer. But what about not-named-after captures which are named after neither fields nor variables? When Extractor encounters such a capture it is stored and then becomes eligible to be used in the UseCapture <name> instruction.

The UseCapture instruction uses not-named-after capture <name> to set the Extractor input buffer. The buffer is emptied (the data it contains is discarded) and then the data previously captured by the SetRE instruction is copied into the buffer.

This instruction can serve as a bridge between RE/SetRE instructions and the rest of the Extractor scripting functionality. For instance the SetRE instruction may be used to capture a timestamp data and then UseCapture followed by Set <field> using "YYYY-MM-DD hh:mm:ss" will set the timestamp <field> according to the specified time format.

If there is no capture called <name> or data for it has not been captured yet then the input buffer is still emptied. However, no data is copied into it. Attempts to use a capture named after a field or variable will trigger a script parsing error.

The RestoreBuffer Script Instruction

The UseCapture instruction overwrites the Extractor input buffer with a named capture. It may be preferable to use this instruction when the processing of the input buffer has been finished so that the loss of its content is unimportant. If this is not the case then RestoreBuffer instruction can be used to restore the buffer (both content and position) to the state which existed before the last UseCapture use. If the RestoreBuffer instruction is used before any UseCapture invocation then the input buffer will be emptied and no data will be copied into it.

Changing Regex Options

Extractor uses two Regex options which cause it to ignore whitespace inside Regex expressions and also cause case-insensitive matching. Both options can be changed using the following script instructions:

IgnoreREWhitespace {On|Off}

IgnoreCaseInRE {On|Off}

If ON and OFF are omitted then ON is assumed. Each instruction can be used once only in a given script. Both apply to subsequent RE directives and have no effect on the preceding ones. Regardless of case-sensitive or case-insensitive data matching, the names of fields and variables are always case-insensitive.

Notes

  • In MATCH mode the current position is advanced by the count of matched (not captured) characters, for example; Regex matched 20 characters but captured only 5.

  • In the RE instruction, the <expression> can be surrounded by optional forward slashes;

RE <name> / <expression> /

These are ignored. If only one slash is found then it will be considered to be a part of the <expression>. The same applies to any forward slashes found inside the <expression> and these slashes do not have to be escaped. Actually, the forward slash is just an example, any character may be used instead of it and using a pair of brackets (curly, angle and square) is supported as well, however, a pair of parenthesis is not.

  • The potential danger with ConsumeRE None|Match is that if SkipLines is overlooked or misplaced the Extractor will stall. In NONE and MATCH modes the SetRE instruction does not read the next line when the input buffer is empty, so this has to be done explicitly using one of the SkipXXX instructions. As an example of a misplaced SkipLines consider the following incorrect script.

ConsumeRE none
...
Label NewLine
SetRE MyRegex1
If regexrc = 0 NewLine
SetRE MyRegex2
SkipLines 1
...

This script will stall as soon as MyRegex1 does not match

Provide feedback on this article