Regular expression syntax

The information in this article is based on a component reference TRegExpr authored by Andrey V. Sorokin anso@mail.ru

Introduction

Regular expressions - is widely-used way to describe patterns for the text searching and verifying whether the text corresponds to the pattern. Special metasymbols allow you to define, for example, what are you looking for:is that substring at the beginning of the input string or this is a certain number of repetitions of the substring.

At first sight regular expressions look scary (all right, the second - even scarier ;) ).

However, you will very quickly appreciate all their power. They will save you many hours of unnecessary coding, and in some cases will work faster than manually coded verifications

I urge you to «play» with the demo program TestRExp.dpr that supplied with distributive. - this will allow you to better understand the principle of regular expressions and debug your own expression. In addition, TestRExp includes many examples of expressions.

Let's begin our meeting with regular expressions!

Simple comparison

Any symbol matches itself, if it does not apply to special metasymbols which descripted below.

Sequence of symbols coincides with the same sequence in the input string, so the pattern «bluh» matches the substring «bluh» in the input string. So far it's simple, isn't it?

If it is necessary that metasymbols or escape-sequence were interpreted as a simple symbols, they must be prefixed by «\», for example, metasymbols «^» usually coincides with the begining of strings, but, if it will look like «\^», it will coincide with the symbol «^», «\\» coincides with «\» etc.

Examples:

  • foobar - finds 'foobar'
  • \^FooBarPtr - finds '^FooBarPtr'

Escape-sequences

Any symbol can be determined using the escape sequences, just as it's done in C or Perl: «\n» means the beginning of the string, «\t» - tabulations, etc. Generally, \xnn, where nn is a sequence of hexadecimal digits, means the symbol with ASCII-code of nn. If you want to define a two-byte (Unicode) symbols, then you have to use following format '\x{nnnn}', where 'nnnn' - one or more of the hexadecimal digits.

  • \xnn - symbol with hex code nn
  • \x{nnnn} - symbol with hex nnnn (more than one byte can be set only in the Unicode mode)
  • \t - tabulation (HT/TAB), can also be \x09
  • \n - new line(string) (NL), can also be \x0a
  • \r - carriage return (CR), can also be \x0d
  • \f - format shifting (FF), can also be \x0c
  • \a - bell (BEL), can also be \x07
  • \e - escape (ESC), can also be \x1b

Examples:

  • foo\x20bar - finds 'foo bar' (note the space in the middle)
  • \tfoobar - finds 'foobar' preceded by tabulation

Lists of symbols

You can define the list by entering the symbols in []. List will match with any single symbol, recited in it.

If the first symbol of list (immediately after «[») - «^», then such list matches with any symbol not recited in the list.

Examples:

  • foob[aeiou]r - finds 'foobar', 'foober' etc but not 'foobbr', 'foobcr' etc
  • foob[^aeiou]r - finds 'foobbr', 'foobcr' etc but not 'foobar', 'foober' etc

Inside the list symbol «-» may be used to determine the symbols range, for example a-z represents all symbols between «a» and «z», inclusively.

If you need to include symbol «-» to the list, place it at the beginning or end of the list or preface with '\'. If you need to put symbol ']' to the list, place it at the beginning or preface with '\'.

Examples:

  • [-az] - 'a', 'z' и '-'
  • [az-] - 'a', 'z' и '-'
  • [a\-z] - 'a', 'z' и '-'
  • [a-z] - all 26 small characters from 'a' to 'z'
  • [\n-\x0D] - #10, #11, #12, #13.
  • [\d-t] - digit, '-' or 't'.
  • []-a] - symbol from range ']'..'a'.

Metasymbols

Metasymbols - special symbols that are the essence of regular expressions. There are several types of metasymbols.

Metasymbols - string delimiters

  • ^ - beginning of the string
  • $ - end of the string
  • \A - beginning of the text
  • \Z - end of the text
  • . - any symbol of the string

Examples:

  • ^foobar - finds 'foobar' only if it is at the beginning of the string
  • foobar$ - finds 'foobar' only if it is at the end of the string
  • ^foobar$ - finds 'foobar' only if it is the only word in the string
  • foob.r - finds 'foobar', 'foobbr', 'foob1r' etc.

Metasymbol «^» defaultly match only at the beginning of the input text, and metasymbol «$» - only in the end of text. Internal string delimiters (line separators) of the text will not coincide with «^» and «$».

However, if you need to work with the text as a multi-string, so «^» would match after each string delimiter within the text, and «$» - before each delimiter, you may turn on the modifier /m.

Metasymbols \A and \Z similar to «^» and «$», but they are not affected by modifier /m, ie they will always coincide only with the beginning, and the end of the entire input text.

Metasymbol «.» defaultly coincides with every symbol, but if you turn off the modifier /s, then '.' doesn't coincide with string delimiters.

TRegExpr interprets string delimiters as recommended by www.unicode.org:

«^» coincides with the beginning of the input text, and also, if modifier is turned on /m, with the point immediately following the \x0D\x0A, \x0A and \x0D (if you using for Unicode-version of TRegExpr, then also \x2028 or \x2029 or \x0B or \x0C or \x85). Note that it is not coincides within the sequence interval \x0D\x0A.

«$» coincides with the end of the input text, and if the modifier is turned on /m, with the point immediately preceding \x0D\x0A, \x0A или \x0D (if you using for Unicode-version of TRegExpr, then also \x2028 or \x2029 or \x0B or \x0C or \x85). Note that it is not coincides within the sequence interval \x0D\x0A.

«.» coincides with any symbol, but if the modifier is turned off /s, then «.» does not coincide with \x0D\x0A or \x0A or \x0D (if you using for Unicode-version of TRegExpr, then also with \x2028 and \x2029 and \x0B and \x0C and \x85).

Note that «^.*$» (pattern for an empty string) does not coincide with the empty string form \x0D\x0A, but coincides with \x0A\x0D.

You can reconfigure the behavior described above when processing multistring text - See description of the properties LineSeparators and LinePairedSeparator, for example, you can reconfigure to use only Unix-line delimiters \n or only DOS/Windows-delimiters \r\n or mixed separators (that configured by default) or even define your own string delimiters (line separators)!

Metasymbols - standard lists of symbols

  • \w - alphanumeric symbol or «_»
  • \W - not \w
  • \d - digit
  • \D - not \d
  • \s - any «space» symbol (by default - [ \t\n\r\f])
  • \S - not \s

standard lists \w, \d and \s may be used also within the symbol lists.

Examples:

  • foob\dr - finds 'foob1r', 'foob6r' и т.д. but not 'foobar', 'foobbr' etc.
  • foob[\w\s]r - finds 'foobar', 'foob r', 'foobbr' etc but not 'foob1r', 'foob=r' etc

TRegExpr uses properties SpaceChars and WordChars to define standard lists \w, \W, \s, \S, ie You can easily redefine the composition of these lists.

Metasymbols - word boundaries

  • \b - Matches at word boundary
  • \B - Doesn't match at word boundary

Word boundary (\b)is a point between the two symbols one of which satisfies \w, and another - \W (in either order), wherein before the beginning and after the end of the string is meant \W.

Metasymbols - repetition

Any item of a regular expression may be followed by another, very important type of metasymbol - Repeater. Using it, you can specify number of repetitions of the previous symbol, metasymbol or subexpression.

  • * - zero or more times («greedy»), the same as {0,}
  • + - one or more times («greedy»), the same as {1,}
  • ? - zero or one («greedy»), the same as {0,1}
  • {n} - exactly n times («greedy»)
  • {n,} - not less then n times («greedy»)
  • {n,m} - not less than n but no more than m times («greedy»)
  • *? - zero or more times («non-greedy»), the same as {0,}?
  • +? - one or more times («non-greedy»), the same as {1,}?
  • ?? - zero or one («non-greedy»), the same as {0,1}?
  • {n}? - exactly n times («non-greedy»)
  • {n,}? - not less then n times («non-greedy»)
  • {n,m}? - not less than n but no more than m times («non-greedy»)

Thus {n,m} specify the minimum n and maximum m repetitions. Repeater {n} is equal {n,n} specifies exactly n repetitioms. Repeater {n,} specifies n or more repetitions. Theoretically the size of n and m is not limited, but it is recommended not to set large values, because, in some situations, it may require significant time and RAM while processing this repeater due to the recursive nature of work.

If a curly bracket occurs in the «wrong» place, where they can not be perceived as a repeater,then they perceived simply as symbols.

Examples:

  • foob.*r - finds 'foobar', 'foobalkjdflkj9r' and 'foobr'
  • foob.+r - finds 'foobar', 'foobalkjdflkj9r' but not 'foobr'
  • foob.?r - finds 'foobar', 'foobbr' and 'foobr' but not 'foobalkj9r'
  • fooba{2}r - finds 'foobaar'
  • fooba{2,}r - finds 'foobaar', 'foobaaar', 'foobaaaar' etc.
  • fooba{2,3}r - finds 'foobaar', or 'foobaaar' but not 'foobaaaar'

A little explanation about the «greed». «Greedy» repeaters trying to capture as much part of the input text as it possible, while «non-greedy» - as small as possible. For example, 'b +' as 'b *' applied to the input string 'abbbbc' find 'bbbb', while 'b+?' finds 'b', and 'b*?' - empty string; 'b{2,3}?' finds 'bb', while 'b{2,3}' finds 'bbb'.

You can switch all repeaters in expression to «non-greedy» mode, using the modifier /g.

Metasymbols - variations

You can define a list of variations, using metasymbol «|» for it separation, for example «fee|fie|foe» finds «fee» or «fie» or «foe», (the same as «f(e|i|o)e»). As a first variation includes everything from the previous metasymbol «(» or «[» or from the begining to the first metasymbol «|», as the last - everything from the last «|» until the end of the expression or to the nearest metasymbol «)». Usually, not to be confused, variation set is always enclosed in brackets, even if without it could be dispensed.

Variants tried from the first and attempts to finish immediately as be able to find in which coincides entire remaining part of the expression (to learn more see The mechanism of operation). This means that variants are not necessarily provide «greedy» behavior.For example, if you apply the expression «foo|foot» to the input string «barefoot», then «foo» will be found because this is the first variation that allowed to coincide entire expression

Note that metasymbol «|» interpreted as a simple symbol within the lists of symbols, for example, [fee|fie|foe] means the same as [feio|].

Examples:

  • foo(bar|foo) - finds 'foobar' or 'foofoo'.

Metasymbols - subexpressions

Metasymbols ( … ) can also be used to set subexpressions - upon completion of the expression searching, you can refer to any subexpression using the properties MatchPos, MatchLen and Match, and substitute subexpressions in a pattern, using method Substitute).

Subexpressions are numbered from left to right, in the order of appearance their opening brackets.

The first subexpression is numbered '1' (expression as a whole - '0', it can be accessed in Substitute as '$0' or '$&').

Examples:

  • (foobar){8,10} - finds the string containing 8, 9 or 10 copies of 'foobar'
  • foob([0-9]|a+)r - finds 'foob0r', 'foob1r' , 'foobar', 'foobaar', 'foobaar' etc.

Metasymbols from \1 to \9 interpreted as the backlinks. \<n> coincides with the previously found subexpression #<n>.

Examples:

  • (.)\1+ - finds 'aaaa' and 'cc'.
  • (.+)\1+ - also finds 'abab' and '123123'
  • (['«]?)(\d+)\1 - finds «13» (in double quotes), or '4' (in single quotes) or 77 (without quotes) etc.

Modifiers

Modifiers are used to change modes of TRegExpr.

You can change the modifiers in several ways.

Any modifier can be changed with a special structure (?…) within the regular expression.

Also, you can assign a value to the corresponding property of the object TRegExpr instance (for example, ModifierX to change modifier /x, or ModifierStr to change several modifiers). Default values ​​for the new object TRegExpr instances defined in the global constants,for example, RegExprModifierX defines default value for ModifierX.

  • i - Case-insensitive mode (Default uses the selected by OS default language), (see InvertCase)
  • m - Perceive as a multistring input text, with metasymbols «^» and «$» coincided not only at the beginning and end of the whole text, but at the beginning and end of the all the text strings (see String delimiters)
  • s - Perceive the input text as a single string. Metasymbol ».« matches any character, if this modifier is off, it does not coincide with the line separators (string delimiters).
  • g - Not a standard modifier. If turned off then all the repeaters are switch into «non-greedy» mode (default this modifier is on). Ie if you turn it off, then all '+' operate as '+?', '*' as '*?' etc.
  • x - Allows you to format pattern to provide easier readability (see description below).
  • r - Not a standard modifier. If turned on, then ranges form a-я include 'ё', А-Я include 'Ё', а-Я include all russian letters.

/ X modifier makes TRegExpr ignore spaces, tabs and line separators, that allows you to format text of expression. Also, if there is a symbol #, all subsequent characters until the end of the line are treated as comments, for example:

( 
 
(abc) # Comment 1
 
  |   # Spaces within expressions are also ignored
 
(efg) # Comment 2
 
)

Naturally, this means that if you want to insert in the expression space, tab, or line separator or #, then in extended (/ x) mode, this can only be done by prefixing the '/' or using /xnn (inside the lists of symbols all this symbols are percieved as usual)

Perl Extensions

(?imsxr-imsxr)

Allows you to change the values ​​of modifiers

Examples:

  • (?i)Saint-Petersburg - finds 'Saint-petersburg' and 'Saint-Petersburg'
  • (?i)Saint-(?-i)Petersburg - finds 'Saint-Petersburg' but not 'Saint-petersburg'
  • (?i)(Saint-)?Petersburg - finds 'Saint-petersburg' and 'saint-petersburg'
  • ((?i)Saint-)?Petersburg - finds 'saint-Petersburg', but not 'saint-petersburg'

(?#text)

Comment is simply ignored. Note that in the comments of this kind is impossible to place the symbol »)«, because it is perceived as the end of a comment.