Categories


Popular topics

m
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
UltraEdit supports Perl style regular expressions for search using the [http://www.boost.org/doc/libs/1_50_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html Boost C++ Libraries].
 
UltraEdit supports Perl style regular expressions for search using the [http://www.boost.org/doc/libs/1_50_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html Boost C++ Libraries].
  
Note: in the following documentation, '''atom''' may refer to a single character, a marked sub-expression, or a character class.
+
Note: in the following documentation, '''atom''' may refer to a single character, a ''marked sub-expression'', or a ''character class''.
  
 
In Perl regular expressions, all characters match themselves except for the following special characters:
 
In Perl regular expressions, all characters match themselves except for the following special characters:
Line 9: Line 9:
 
!Meaning
 
!Meaning
 
|-
 
|-
|.
+
|<code>.</code>
 
|Matches any single character except new lines
 
|Matches any single character except new lines
 
|-
 
|-
|^
+
|<code>^</code>
|Matches start of line (anchor)
+
|Matches start of line position (anchor)
 
|-
 
|-
|$
+
|<code>$</code>
|Matches end of line (anchor)
+
|Matches end of line position (anchor)
 
|-
 
|-
|*
+
|<code>*</code>
 
|Matches 0 or more of the preceding atom
 
|Matches 0 or more of the preceding atom
 
|-
 
|-
|<nowiki>+</nowiki>
+
|<code>+</code>
 
|Matches 1 or more of the preceding atom
 
|Matches 1 or more of the preceding atom
 
|-
 
|-
|?
+
|<code>?</code>
 
|Matches 0 or 1 of the preceding atom
 
|Matches 0 or 1 of the preceding atom
 
|-
 
|-
|<nowiki>[]</nowiki>
+
|<code>[]</code>
|Matches any character in the set. For example <nowiki>[a-d]</nowiki> would match a, b, c, and d &ndash; but not e.
+
|Matches any character in the set. For example <code>[a-d]</code> would match a, b, c, and d &ndash; but not e.
 
|-
 
|-
|()
+
|<code>()</code>
 
|Tags the enclosed atom for backreferencing
 
|Tags the enclosed atom for backreferencing
 
|-
 
|-
|{n}
+
|<code>{n}</code>
|Matches the previous atom n times.
+
|Matches the previous atom ''n'' times
 
|-
 
|-
|<nowiki>|</nowiki>
+
|<code><nowiki>|</nowiki></code>
|"or" operand. For example <nowiki>dog|cat</nowiki> would match both "dog" and "cat."
+
|"or" operand. For example <code><nowiki>dog|cat</nowiki></code> would match both "dog" and "cat"
 
|-
 
|-
|\
+
|<code>\</code>
 
|Escape character
 
|Escape character
 
|-
 
|-
Line 46: Line 46:
 
   
 
   
 
'''Marked sub-expressions'''<br>
 
'''Marked sub-expressions'''<br>
A section beginning ( and ending ) is a marked sub-expression.  Whatever matched the sub-expression is split out in a separate field by the matching algorithms.  Marked sub-expressions can also be repeated or referred to by a backreference.
+
A section beginning with a <code>(</code> and ending with a <code>)</code> is a ''marked sub-expression''.  Whatever matches this sub-expression is split into a separate field by the matching algorithms.  Marked sub-expressions can also be repeated or referred to by a ''backreference''.
 
   
 
   
 
'''Alternation'''<br>
 
'''Alternation'''<br>
The | operator will match either of its arguments, so for example: abc|def will match either "abc" or "def".  Parenthesis can be used to group alternations, for example: ab(d|ef) will match either of "abd" or "abef". Empty alternatives are not allowed (these are almost always a mistake), but if you really want an empty alternative use (?:) as a placeholder, for example:
+
The <code>|</code> operator will match either of its arguments, so for example: <code>abc|def</code> will match either "abc" or "def".  Parenthesis can be used to group alternations, for example: <code>ab(d|ef)</code> will match either of "abd" or "abef". Empty alternatives are not allowed (these are almost always a mistake), but if you really want an empty alternative use <code>(?:)</code> as a placeholder, for example:
  
* "|abc" is not a valid expression, but
+
* <code>|abc</code> is not a valid expression, but
* "(?:)|abc" is and is equivalent, also the expression:
+
* <code>(?:)|abc</code> is and is equivalent, also the expression:
* "(?:abc)??" has exactly the same effect.
+
* <code>(?:abc)??</code> has exactly the same effect.
 
   
 
   
 
'''Character sets'''<br>
 
'''Character sets'''<br>
A character set is a bracket-expression starting with <nowiki>[ and ending with ]</nowiki>, it defines a set of characters, and matches any single character that is a member of that set.
+
A ''character set'' is a bracketed expression starting with <nowiki><code>[</code> and ending with <code>]</code></nowiki> which defines a set of characters, and matches any single character that is a member of that set.
A bracket expression may contain any combination of the following:
+
This bracketed expression may contain any combination of the following:
* '''Single characters:''' For example <nowiki>[abc]</nowiki>, will match any of the characters 'a', 'b', or 'c'.
+
* '''Single characters:''' For example <code>[abc]</code>, will match any of the characters "a", "b", or "c".
* '''Character ranges:''' For example <nowiki>[a-c]</nowiki> will match any single character in the range 'a' to 'c'.  By default, for POSIX-Perl regular expressions, a character x is within the range y to z, if it collates within that range; this results in locale specific behavior.   
+
* '''Character ranges:''' For example <code>[a-c]</code> will match any single character in the range "a" to "c'.  By default, for POSIX-Perl regular expressions, character "x" is within the range "y to z", if it collates within that range; this results in locale specific behavior.   
* '''Negation:''' If the bracket-expression begins with the ^ character, then it matches the complement of the characters it contains, for example <nowiki>[^a-c]</nowiki> matches any character that is not in the range a-c.
+
* '''Negation:''' If the bracket-expression begins with the <code>^</code> character, then it matches the complement of the characters it contains, for example <code>[^a-c]</code> matches any character that is ''not'' in the range "a-c".
* '''Character classes:''' An expression of the form <nowiki>[[:name:]]</nowiki> matches the named character class "name", for example <nowiki>[[:lower:]]</nowiki> matches any lower case character.
+
* '''Character classes:''' An expression of the form <code><nowiki>[[:name:]]</nowiki></code> matches the named character class "name", for example <code><nowiki>[[:lower:]]</nowiki></code> matches any lower case character. You can see all available character class names below, or on [https://www.boost.org/doc/libs/1_50_0/libs/regex/doc/html/boost_regex/syntax/character_classes.html Boost's Perl library online documentation].
  
'''Supported dcharacter class names''' (following the format <nowiki>[[:name:]]</nowiki>)<br>
+
'''Supported character class names''' (following the format <code><nowiki>[[:name:]]</nowiki></code>)<br>
 
{|
 
{|
 
!Name
 
!Name
Line 71: Line 71:
 
|alnum
 
|alnum
 
|Yes
 
|Yes
|Any alpha-numeric character.
+
|Any alpha-numeric character
 
|-
 
|-
 
|alpha
 
|alpha
 
|Yes
 
|Yes
|Any alphabetic character.
+
|Any alphabetic character
 
|-
 
|-
 
|blank
 
|blank
 
|Yes
 
|Yes
|Any whitespace character that is not a line separator.
+
|Any whitespace character that is not a line separator
 
|-
 
|-
 
|cntrl
 
|cntrl
 
|Yes
 
|Yes
|Any control character.
+
|Any control character
 
|-
 
|-
 
|d
 
|d
Line 91: Line 91:
 
|digit
 
|digit
 
|Yes
 
|Yes
|Any decimal digit.
+
|Any decimal digit
 
|-
 
|-
 
|graph
 
|graph
 
|Yes
 
|Yes
|Any graphical character.
+
|Any graphical character
 
|-
 
|-
 
|l
 
|l
 
|No
 
|No
|Any lower case character.
+
|Any lower case character
 
|-
 
|-
 
|lower
 
|lower
 
|Yes
 
|Yes
|Any lower case character.
+
|Any lower case character
 
|-
 
|-
 
|print
 
|print
 
|Yes
 
|Yes
|Any printable character.
+
|Any printable character
 
|-
 
|-
 
|punct
 
|punct
 
|Yes
 
|Yes
|Any punctuation character.
+
|Any punctuation character
 
|-
 
|-
 
|s
 
|s
 
|No
 
|No
|Any whitespace character.
+
|Any whitespace character
 
|-
 
|-
 
|space
 
|space
 
|Yes
 
|Yes
|Any whitespace character.
+
|Any whitespace character
 
|-
 
|-
 
|unicode
 
|unicode
 
|No
 
|No
|Any extended character whose code point is above 255 in value.
+
|Any extended character whose code point is above 255 in value
 
|-
 
|-
 
|u
 
|u
 
|No
 
|No
|Any upper case character.
+
|Any upper case character
 
|-
 
|-
 
|upper
 
|upper
 
|Yes
 
|Yes
|Any upper case character.
+
|Any upper case character
 
|-
 
|-
 
|w
 
|w
 
|No
 
|No
|Any word character (alphanumeric characters plus the underscore).
+
|Any word character (alphanumeric characters plus the underscore)
 
|-
 
|-
 
|word
 
|word
 
|No
 
|No
|Any word character (alphanumeric characters plus the underscore).
+
|Any word character (alphanumeric characters plus the underscore)
 
|-
 
|-
 
|xdigit
 
|xdigit
 
|Yes
 
|Yes
|Any hexadecimal digit character.
+
|Any hexadecimal digit character
 
|}
 
|}
  
Line 154: Line 154:
 
!Character
 
!Character
 
|-
 
|-
|\a
+
|<code>\a</code>
 
|'\a'
 
|'\a'
 
|-
 
|-
|\e
+
|<code>\e</code>
 
|0x1B
 
|0x1B
 
|-
 
|-
|\f
+
|<code>\f</code>
 
|\f
 
|\f
 
|-
 
|-
|\n
+
|<code>\n</code>
 
|\n
 
|\n
 
|-
 
|-
|\r
+
|<code>\r</code>
 
|\r
 
|\r
 
|-
 
|-
|\t
+
|<code>\t</code>
 
|\t
 
|\t
 
|-
 
|-
|\v
+
|<code>\v</code>
 
|\v
 
|\v
 
|-
 
|-
|\b
+
|<code>\b</code>
 
|\b (but only inside a character class declaration).
 
|\b (but only inside a character class declaration).
 
|-
 
|-
|\cX
+
|<code>\cX</code>
|An ASCII escape sequence - the character whose code point is X % 32
+
|An ASCII escape sequence - the character whose code point is "X" % 32
 
|-
 
|-
|\xdd
+
|<code>\xdd</code>
 
|A hexadecimal escape sequence - matches the single character whose code point is 0xdd.
 
|A hexadecimal escape sequence - matches the single character whose code point is 0xdd.
 
|-
 
|-
|\x{dddd}
+
|<code>\x{dddd}</code>
 
|A hexadecimal escape sequence - matches the single character whose code point is 0xdddd.
 
|A hexadecimal escape sequence - matches the single character whose code point is 0xdddd.
 
|-
 
|-
|\0ddd
+
|<code>\0ddd</code>
 
|An octal escape sequence - matches the single character whose code point is 0ddd.
 
|An octal escape sequence - matches the single character whose code point is 0ddd.
 
|-
 
|-
|\N{name}
+
|<code>\N{name}</code>
 
|Matches the single character which has the symbolic name name.  For example \N{newline} matches the single character \n.
 
|Matches the single character which has the symbolic name name.  For example \N{newline} matches the single character \n.
 
|}
 
|}
 
   
 
   
 
'''"Single character" character classes'''<br>
 
'''"Single character" character classes'''<br>
Any escaped character x, if x is the name of a character class shall match any character that is a member of that class, and any escaped character X, if x is the name of a character class, shall match any character not in that class.  The following are supported by default:
+
Any escaped character "x", if "x" is the name of a character class, will match any character that is a member of that class. Any escaped character "X", if "x" is the name of a character class, shall match any character ''not'' in that class.  The following are supported by default:
  
 
{|
 
{|
Line 201: Line 201:
 
!Equivalent to
 
!Equivalent to
 
|-
 
|-
|\d
+
|<code>\d</code>
 
|<nowiki>[[:digit:]]</nowiki>
 
|<nowiki>[[:digit:]]</nowiki>
 
|-
 
|-
|\l
+
|<code>\l</code>
 
|<nowiki>[[:lower:]]</nowiki>
 
|<nowiki>[[:lower:]]</nowiki>
 
|-
 
|-
|\s
+
|<code>\s</code>
 
|<nowiki>[[:space:]]</nowiki>
 
|<nowiki>[[:space:]]</nowiki>
 
|-
 
|-
|\u
+
|<code>\u</code>
 
|<nowiki>[[:upper:]]</nowiki>
 
|<nowiki>[[:upper:]]</nowiki>
 
|-
 
|-
|\w
+
|<code>\w</code>
 
|<nowiki>[[:word:]]</nowiki>
 
|<nowiki>[[:word:]]</nowiki>
 
|-
 
|-
|\D
+
|<code>\D</code>
 
|<nowiki>[^[:digit:]]</nowiki>
 
|<nowiki>[^[:digit:]]</nowiki>
 
|-
 
|-
|\L
+
|<code>\L</code>
 
|<nowiki>[^[:lower:]]</nowiki>
 
|<nowiki>[^[:lower:]]</nowiki>
 
|-
 
|-
|\S
+
|<code>\S</code>
 
|<nowiki>[^[:space:]]</nowiki>
 
|<nowiki>[^[:space:]]</nowiki>
 
|-
 
|-
|\U
+
|<code>\U</code>
 
|<nowiki>[^[:upper:]]</nowiki>
 
|<nowiki>[^[:upper:]]</nowiki>
 
|-
 
|-
|\W
+
|<code>\W</code>
 
|<nowiki>[^[:word:]]</nowiki>
 
|<nowiki>[^[:word:]]</nowiki>
 
|}
 
|}
  
'''Word Boundaries'''<br>
+
'''Assertions'''<br>
The following escape sequences match the boundaries of words:
+
Besides <code>^</code> and <code>$</code>, Perl regular expressions support the following zero-width assertions:
  
 
{|
 
{|
|\<
+
|<code>\<</code>
 
|Matches the start of a word.
 
|Matches the start of a word.
 
|-
 
|-
|\>
+
|<code>\></code>
 
|Matches the end of a word.
 
|Matches the end of a word.
 
|-
 
|-
|\b
+
|<code>\b</code>
 
|Matches a word boundary (the start or end of a word).
 
|Matches a word boundary (the start or end of a word).
 
|-
 
|-
|\B
+
|<code>\B</code>
 
|Matches only when not at a word boundary.
 
|Matches only when not at a word boundary.
 +
|-
 +
|<code>\A</code>
 +
|Matches beginning of the file.
 +
|-
 +
|<code>\Z</code>
 +
|Matches position of last non-newline character in the file.
 +
|-
 +
|<code>\z</code>
 +
|Matches end of the file.
 
|}
 
|}
 
   
 
   

Latest revision as of 19:10, 7 June 2018

UltraEdit supports Perl style regular expressions for search using the Boost C++ Libraries.

Note: in the following documentation, atom may refer to a single character, a marked sub-expression, or a character class.

In Perl regular expressions, all characters match themselves except for the following special characters:

Character Meaning
. Matches any single character except new lines
^ Matches start of line position (anchor)
$ Matches end of line position (anchor)
* Matches 0 or more of the preceding atom
+ Matches 1 or more of the preceding atom
? Matches 0 or 1 of the preceding atom
[] Matches any character in the set. For example [a-d] would match a, b, c, and d – but not e.
() Tags the enclosed atom for backreferencing
{n} Matches the previous atom n times
| "or" operand. For example dog|cat would match both "dog" and "cat"
\ Escape character


Marked sub-expressions
A section beginning with a ( and ending with a ) is a marked sub-expression. Whatever matches this sub-expression is split into a separate field by the matching algorithms. Marked sub-expressions can also be repeated or referred to by a backreference.

Alternation
The | operator will match either of its arguments, so for example: abc|def will match either "abc" or "def". Parenthesis can be used to group alternations, for example: ab(d|ef) will match either of "abd" or "abef". Empty alternatives are not allowed (these are almost always a mistake), but if you really want an empty alternative use (?:) as a placeholder, for example:

  • |abc is not a valid expression, but
  • (?:)|abc is and is equivalent, also the expression:
  • (?:abc)?? has exactly the same effect.

Character sets
A character set is a bracketed expression starting with <code>[</code> and ending with <code>]</code> which defines a set of characters, and matches any single character that is a member of that set. This bracketed expression may contain any combination of the following:

  • Single characters: For example [abc], will match any of the characters "a", "b", or "c".
  • Character ranges: For example [a-c] will match any single character in the range "a" to "c'. By default, for POSIX-Perl regular expressions, character "x" is within the range "y to z", if it collates within that range; this results in locale specific behavior.
  • Negation: If the bracket-expression begins with the ^ character, then it matches the complement of the characters it contains, for example [^a-c] matches any character that is not in the range "a-c".
  • Character classes: An expression of the form [[:name:]] matches the named character class "name", for example [[:lower:]] matches any lower case character. You can see all available character class names below, or on Boost's Perl library online documentation.

Supported character class names (following the format [[:name:]])

Name POSIX-standard Description
alnum Yes Any alpha-numeric character
alpha Yes Any alphabetic character
blank Yes Any whitespace character that is not a line separator
cntrl Yes Any control character
d No Any decimal digit
digit Yes Any decimal digit
graph Yes Any graphical character
l No Any lower case character
lower Yes Any lower case character
print Yes Any printable character
punct Yes Any punctuation character
s No Any whitespace character
space Yes Any whitespace character
unicode No Any extended character whose code point is above 255 in value
u No Any upper case character
upper Yes Any upper case character
w No Any word character (alphanumeric characters plus the underscore)
word No Any word character (alphanumeric characters plus the underscore)
xdigit Yes Any hexadecimal digit character


Escapes
Any special character preceded by an escape matches itself. The following escape sequences are also supported (all synonyms for single characters):

Escape Character
\a '\a'
\e 0x1B
\f \f
\n \n
\r \r
\t \t
\v \v
\b \b (but only inside a character class declaration).
\cX An ASCII escape sequence - the character whose code point is "X" % 32
\xdd A hexadecimal escape sequence - matches the single character whose code point is 0xdd.
\x{dddd} A hexadecimal escape sequence - matches the single character whose code point is 0xdddd.
\0ddd An octal escape sequence - matches the single character whose code point is 0ddd.
\N{name} Matches the single character which has the symbolic name name. For example \N{newline} matches the single character \n.

"Single character" character classes
Any escaped character "x", if "x" is the name of a character class, will match any character that is a member of that class. Any escaped character "X", if "x" is the name of a character class, shall match any character not in that class. The following are supported by default:

Escape sequence Equivalent to
\d [[:digit:]]
\l [[:lower:]]
\s [[:space:]]
\u [[:upper:]]
\w [[:word:]]
\D [^[:digit:]]
\L [^[:lower:]]
\S [^[:space:]]
\U [^[:upper:]]
\W [^[:word:]]

Assertions
Besides ^ and $, Perl regular expressions support the following zero-width assertions:

\< Matches the start of a word.
\> Matches the end of a word.
\b Matches a word boundary (the start or end of a word).
\B Matches only when not at a word boundary.
\A Matches beginning of the file.
\Z Matches position of last non-newline character in the file.
\z Matches end of the file.

For further information options on Perl regular please see:

MediaWiki spam blocked by CleanTalk.