m |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
UltraEdit supports Perl style regular expressions for search using the [http://www.boost.org/doc/libs/1_50_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html Boost C++ Libraries]. | UltraEdit supports Perl style regular expressions for search using the [http://www.boost.org/doc/libs/1_50_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html Boost C++ Libraries]. | ||
− | Note: in the following documentation, '''atom''' may refer to a single character, a marked sub-expression, or a character class. | + | Note: in the following documentation, '''atom''' may refer to a single character, a ''marked sub-expression'', or a ''character class''. |
In Perl regular expressions, all characters match themselves except for the following special characters: | In Perl regular expressions, all characters match themselves except for the following special characters: | ||
Line 9: | Line 9: | ||
!Meaning | !Meaning | ||
|- | |- | ||
− | |. | + | |<code>.</code> |
|Matches any single character except new lines | |Matches any single character except new lines | ||
|- | |- | ||
− | |^ | + | |<code>^</code> |
− | |Matches start of line (anchor) | + | |Matches start of line position (anchor) |
|- | |- | ||
− | |$ | + | |<code>$</code> |
− | |Matches end of line (anchor) | + | |Matches end of line position (anchor) |
|- | |- | ||
− | |* | + | |<code>*</code> |
|Matches 0 or more of the preceding atom | |Matches 0 or more of the preceding atom | ||
|- | |- | ||
− | |< | + | |<code>+</code> |
|Matches 1 or more of the preceding atom | |Matches 1 or more of the preceding atom | ||
|- | |- | ||
− | |? | + | |<code>?</code> |
|Matches 0 or 1 of the preceding atom | |Matches 0 or 1 of the preceding atom | ||
|- | |- | ||
− | |< | + | |<code>[]</code> |
− | |Matches any character in the set. For example < | + | |Matches any character in the set. For example <code>[a-d]</code> would match a, b, c, and d – but not e. |
|- | |- | ||
− | |() | + | |<code>()</code> |
|Tags the enclosed atom for backreferencing | |Tags the enclosed atom for backreferencing | ||
|- | |- | ||
− | |{n} | + | |<code>{n}</code> |
− | |Matches the previous atom n times | + | |Matches the previous atom ''n'' times |
|- | |- | ||
− | |<nowiki>|</nowiki> | + | |<code><nowiki>|</nowiki></code> |
− | |"or" operand. For example <nowiki>dog|cat</nowiki> would match both "dog" and "cat | + | |"or" operand. For example <code><nowiki>dog|cat</nowiki></code> would match both "dog" and "cat" |
|- | |- | ||
− | |\ | + | |<code>\</code> |
|Escape character | |Escape character | ||
|- | |- | ||
Line 46: | Line 46: | ||
'''Marked sub-expressions'''<br> | '''Marked sub-expressions'''<br> | ||
− | A section beginning ( and ending ) is a marked sub-expression. Whatever | + | A section beginning with a <code>(</code> and ending with a <code>)</code> is a ''marked sub-expression''. Whatever matches this sub-expression is split into a separate field by the matching algorithms. Marked sub-expressions can also be repeated or referred to by a ''backreference''. |
'''Alternation'''<br> | '''Alternation'''<br> | ||
− | The | operator will match either of its arguments, so for example: abc|def will match either "abc" or "def". Parenthesis can be used to group alternations, for example: ab(d|ef) will match either of "abd" or "abef". Empty alternatives are not allowed (these are almost always a mistake), but if you really want an empty alternative use (?:) as a placeholder, for example: | + | The <code>|</code> operator will match either of its arguments, so for example: <code>abc|def</code> will match either "abc" or "def". Parenthesis can be used to group alternations, for example: <code>ab(d|ef)</code> will match either of "abd" or "abef". Empty alternatives are not allowed (these are almost always a mistake), but if you really want an empty alternative use <code>(?:)</code> as a placeholder, for example: |
− | * | + | * <code>|abc</code> is not a valid expression, but |
− | * | + | * <code>(?:)|abc</code> is and is equivalent, also the expression: |
− | * | + | * <code>(?:abc)??</code> has exactly the same effect. |
'''Character sets'''<br> | '''Character sets'''<br> | ||
− | A character set is a | + | A ''character set'' is a bracketed expression starting with <nowiki><code>[</code> and ending with <code>]</code></nowiki> which defines a set of characters, and matches any single character that is a member of that set. |
− | + | This bracketed expression may contain any combination of the following: | |
− | * '''Single characters:''' For example < | + | * '''Single characters:''' For example <code>[abc]</code>, will match any of the characters "a", "b", or "c". |
− | * '''Character ranges:''' For example < | + | * '''Character ranges:''' For example <code>[a-c]</code> will match any single character in the range "a" to "c'. By default, for POSIX-Perl regular expressions, character "x" is within the range "y to z", if it collates within that range; this results in locale specific behavior. |
− | * '''Negation:''' If the bracket-expression begins with the ^ character, then it matches the complement of the characters it contains, for example < | + | * '''Negation:''' If the bracket-expression begins with the <code>^</code> character, then it matches the complement of the characters it contains, for example <code>[^a-c]</code> matches any character that is ''not'' in the range "a-c". |
− | * '''Character classes:''' An expression of the form <nowiki>[[:name:]]</nowiki> matches the named character class "name", for example <nowiki>[[:lower:]]</nowiki> matches any lower case character. | + | * '''Character classes:''' An expression of the form <code><nowiki>[[:name:]]</nowiki></code> matches the named character class "name", for example <code><nowiki>[[:lower:]]</nowiki></code> matches any lower case character. You can see all available character class names below, or on [https://www.boost.org/doc/libs/1_50_0/libs/regex/doc/html/boost_regex/syntax/character_classes.html Boost's Perl library online documentation]. |
− | '''Supported | + | '''Supported character class names''' (following the format <code><nowiki>[[:name:]]</nowiki></code>)<br> |
{| | {| | ||
!Name | !Name | ||
Line 71: | Line 71: | ||
|alnum | |alnum | ||
|Yes | |Yes | ||
− | |Any alpha-numeric character | + | |Any alpha-numeric character |
|- | |- | ||
|alpha | |alpha | ||
|Yes | |Yes | ||
− | |Any alphabetic character | + | |Any alphabetic character |
|- | |- | ||
|blank | |blank | ||
|Yes | |Yes | ||
− | |Any whitespace character that is not a line separator | + | |Any whitespace character that is not a line separator |
|- | |- | ||
|cntrl | |cntrl | ||
|Yes | |Yes | ||
− | |Any control character | + | |Any control character |
|- | |- | ||
|d | |d | ||
Line 91: | Line 91: | ||
|digit | |digit | ||
|Yes | |Yes | ||
− | |Any decimal digit | + | |Any decimal digit |
|- | |- | ||
|graph | |graph | ||
|Yes | |Yes | ||
− | |Any graphical character | + | |Any graphical character |
|- | |- | ||
|l | |l | ||
|No | |No | ||
− | |Any lower case character | + | |Any lower case character |
|- | |- | ||
|lower | |lower | ||
|Yes | |Yes | ||
− | |Any lower case character | + | |Any lower case character |
|- | |- | ||
|print | |print | ||
|Yes | |Yes | ||
− | |Any printable character | + | |Any printable character |
|- | |- | ||
|punct | |punct | ||
|Yes | |Yes | ||
− | |Any punctuation character | + | |Any punctuation character |
|- | |- | ||
|s | |s | ||
|No | |No | ||
− | |Any whitespace character | + | |Any whitespace character |
|- | |- | ||
|space | |space | ||
|Yes | |Yes | ||
− | |Any whitespace character | + | |Any whitespace character |
|- | |- | ||
|unicode | |unicode | ||
|No | |No | ||
− | |Any extended character whose code point is above 255 in value | + | |Any extended character whose code point is above 255 in value |
|- | |- | ||
|u | |u | ||
|No | |No | ||
− | |Any upper case character | + | |Any upper case character |
|- | |- | ||
|upper | |upper | ||
|Yes | |Yes | ||
− | |Any upper case character | + | |Any upper case character |
|- | |- | ||
|w | |w | ||
|No | |No | ||
− | |Any word character (alphanumeric characters plus the underscore) | + | |Any word character (alphanumeric characters plus the underscore) |
|- | |- | ||
|word | |word | ||
|No | |No | ||
− | |Any word character (alphanumeric characters plus the underscore) | + | |Any word character (alphanumeric characters plus the underscore) |
|- | |- | ||
|xdigit | |xdigit | ||
|Yes | |Yes | ||
− | |Any hexadecimal digit character | + | |Any hexadecimal digit character |
|} | |} | ||
Line 154: | Line 154: | ||
!Character | !Character | ||
|- | |- | ||
− | |\a | + | |<code>\a</code> |
|'\a' | |'\a' | ||
|- | |- | ||
− | |\e | + | |<code>\e</code> |
|0x1B | |0x1B | ||
|- | |- | ||
− | |\f | + | |<code>\f</code> |
|\f | |\f | ||
|- | |- | ||
− | |\n | + | |<code>\n</code> |
|\n | |\n | ||
|- | |- | ||
− | |\r | + | |<code>\r</code> |
|\r | |\r | ||
|- | |- | ||
− | |\t | + | |<code>\t</code> |
|\t | |\t | ||
|- | |- | ||
− | |\v | + | |<code>\v</code> |
|\v | |\v | ||
|- | |- | ||
− | |\b | + | |<code>\b</code> |
|\b (but only inside a character class declaration). | |\b (but only inside a character class declaration). | ||
|- | |- | ||
− | |\cX | + | |<code>\cX</code> |
− | |An ASCII escape sequence - the character whose code point is X % 32 | + | |An ASCII escape sequence - the character whose code point is "X" % 32 |
|- | |- | ||
− | |\xdd | + | |<code>\xdd</code> |
|A hexadecimal escape sequence - matches the single character whose code point is 0xdd. | |A hexadecimal escape sequence - matches the single character whose code point is 0xdd. | ||
|- | |- | ||
− | |\x{dddd} | + | |<code>\x{dddd}</code> |
|A hexadecimal escape sequence - matches the single character whose code point is 0xdddd. | |A hexadecimal escape sequence - matches the single character whose code point is 0xdddd. | ||
|- | |- | ||
− | |\0ddd | + | |<code>\0ddd</code> |
|An octal escape sequence - matches the single character whose code point is 0ddd. | |An octal escape sequence - matches the single character whose code point is 0ddd. | ||
|- | |- | ||
− | |\N{name} | + | |<code>\N{name}</code> |
|Matches the single character which has the symbolic name name. For example \N{newline} matches the single character \n. | |Matches the single character which has the symbolic name name. For example \N{newline} matches the single character \n. | ||
|} | |} | ||
'''"Single character" character classes'''<br> | '''"Single character" character classes'''<br> | ||
− | Any escaped character x, if x is the name of a character class | + | Any escaped character "x", if "x" is the name of a character class, will match any character that is a member of that class. Any escaped character "X", if "x" is the name of a character class, shall match any character ''not'' in that class. The following are supported by default: |
{| | {| | ||
Line 201: | Line 201: | ||
!Equivalent to | !Equivalent to | ||
|- | |- | ||
− | |\d | + | |<code>\d</code> |
|<nowiki>[[:digit:]]</nowiki> | |<nowiki>[[:digit:]]</nowiki> | ||
|- | |- | ||
− | |\l | + | |<code>\l</code> |
|<nowiki>[[:lower:]]</nowiki> | |<nowiki>[[:lower:]]</nowiki> | ||
|- | |- | ||
− | |\s | + | |<code>\s</code> |
|<nowiki>[[:space:]]</nowiki> | |<nowiki>[[:space:]]</nowiki> | ||
|- | |- | ||
− | |\u | + | |<code>\u</code> |
|<nowiki>[[:upper:]]</nowiki> | |<nowiki>[[:upper:]]</nowiki> | ||
|- | |- | ||
− | |\w | + | |<code>\w</code> |
|<nowiki>[[:word:]]</nowiki> | |<nowiki>[[:word:]]</nowiki> | ||
|- | |- | ||
− | |\D | + | |<code>\D</code> |
|<nowiki>[^[:digit:]]</nowiki> | |<nowiki>[^[:digit:]]</nowiki> | ||
|- | |- | ||
− | |\L | + | |<code>\L</code> |
|<nowiki>[^[:lower:]]</nowiki> | |<nowiki>[^[:lower:]]</nowiki> | ||
|- | |- | ||
− | |\S | + | |<code>\S</code> |
|<nowiki>[^[:space:]]</nowiki> | |<nowiki>[^[:space:]]</nowiki> | ||
|- | |- | ||
− | |\U | + | |<code>\U</code> |
|<nowiki>[^[:upper:]]</nowiki> | |<nowiki>[^[:upper:]]</nowiki> | ||
|- | |- | ||
− | |\W | + | |<code>\W</code> |
|<nowiki>[^[:word:]]</nowiki> | |<nowiki>[^[:word:]]</nowiki> | ||
|} | |} | ||
− | ''' | + | '''Assertions'''<br> |
− | + | Besides <code>^</code> and <code>$</code>, Perl regular expressions support the following zero-width assertions: | |
{| | {| | ||
− | |\< | + | |<code>\<</code> |
|Matches the start of a word. | |Matches the start of a word. | ||
|- | |- | ||
− | |\> | + | |<code>\></code> |
|Matches the end of a word. | |Matches the end of a word. | ||
|- | |- | ||
− | |\b | + | |<code>\b</code> |
|Matches a word boundary (the start or end of a word). | |Matches a word boundary (the start or end of a word). | ||
|- | |- | ||
− | |\B | + | |<code>\B</code> |
|Matches only when not at a word boundary. | |Matches only when not at a word boundary. | ||
+ | |- | ||
+ | |<code>\A</code> | ||
+ | |Matches beginning of the file. | ||
+ | |- | ||
+ | |<code>\Z</code> | ||
+ | |Matches position of last non-newline character in the file. | ||
+ | |- | ||
+ | |<code>\z</code> | ||
+ | |Matches end of the file. | ||
|} | |} | ||
UltraEdit supports Perl style regular expressions for search using the Boost C++ Libraries.
Note: in the following documentation, atom may refer to a single character, a marked sub-expression, or a character class.
In Perl regular expressions, all characters match themselves except for the following special characters:
Character | Meaning |
---|---|
.
|
Matches any single character except new lines |
^
|
Matches start of line position (anchor) |
$
|
Matches end of line position (anchor) |
*
|
Matches 0 or more of the preceding atom |
+
|
Matches 1 or more of the preceding atom |
?
|
Matches 0 or 1 of the preceding atom |
[]
|
Matches any character in the set. For example [a-d] would match a, b, c, and d – but not e.
|
()
|
Tags the enclosed atom for backreferencing |
{n}
|
Matches the previous atom n times |
|
|
"or" operand. For example dog|cat would match both "dog" and "cat"
|
\
|
Escape character |
Marked sub-expressions
A section beginning with a (
and ending with a )
is a marked sub-expression. Whatever matches this sub-expression is split into a separate field by the matching algorithms. Marked sub-expressions can also be repeated or referred to by a backreference.
Alternation
The |
operator will match either of its arguments, so for example: abc|def
will match either "abc" or "def". Parenthesis can be used to group alternations, for example: ab(d|ef)
will match either of "abd" or "abef". Empty alternatives are not allowed (these are almost always a mistake), but if you really want an empty alternative use (?:)
as a placeholder, for example:
|abc
is not a valid expression, but(?:)|abc
is and is equivalent, also the expression:(?:abc)??
has exactly the same effect.Character sets
A character set is a bracketed expression starting with <code>[</code> and ending with <code>]</code> which defines a set of characters, and matches any single character that is a member of that set.
This bracketed expression may contain any combination of the following:
[abc]
, will match any of the characters "a", "b", or "c".[a-c]
will match any single character in the range "a" to "c'. By default, for POSIX-Perl regular expressions, character "x" is within the range "y to z", if it collates within that range; this results in locale specific behavior. ^
character, then it matches the complement of the characters it contains, for example [^a-c]
matches any character that is not in the range "a-c".[[:name:]]
matches the named character class "name", for example [[:lower:]]
matches any lower case character. You can see all available character class names below, or on Boost's Perl library online documentation.Supported character class names (following the format [[:name:]]
)
Name | POSIX-standard | Description |
---|---|---|
alnum | Yes | Any alpha-numeric character |
alpha | Yes | Any alphabetic character |
blank | Yes | Any whitespace character that is not a line separator |
cntrl | Yes | Any control character |
d | No | Any decimal digit |
digit | Yes | Any decimal digit |
graph | Yes | Any graphical character |
l | No | Any lower case character |
lower | Yes | Any lower case character |
Yes | Any printable character | |
punct | Yes | Any punctuation character |
s | No | Any whitespace character |
space | Yes | Any whitespace character |
unicode | No | Any extended character whose code point is above 255 in value |
u | No | Any upper case character |
upper | Yes | Any upper case character |
w | No | Any word character (alphanumeric characters plus the underscore) |
word | No | Any word character (alphanumeric characters plus the underscore) |
xdigit | Yes | Any hexadecimal digit character |
Escapes
Any special character preceded by an escape matches itself. The following escape sequences are also supported (all synonyms for single characters):
Escape | Character |
---|---|
\a
|
'\a' |
\e
|
0x1B |
\f
|
\f |
\n
|
\n |
\r
|
\r |
\t
|
\t |
\v
|
\v |
\b
|
\b (but only inside a character class declaration). |
\cX
|
An ASCII escape sequence - the character whose code point is "X" % 32 |
\xdd
|
A hexadecimal escape sequence - matches the single character whose code point is 0xdd. |
\x{dddd}
|
A hexadecimal escape sequence - matches the single character whose code point is 0xdddd. |
\0ddd
|
An octal escape sequence - matches the single character whose code point is 0ddd. |
\N{name}
|
Matches the single character which has the symbolic name name. For example \N{newline} matches the single character \n. |
"Single character" character classes
Any escaped character "x", if "x" is the name of a character class, will match any character that is a member of that class. Any escaped character "X", if "x" is the name of a character class, shall match any character not in that class. The following are supported by default:
Escape sequence | Equivalent to |
---|---|
\d
|
[[:digit:]] |
\l
|
[[:lower:]] |
\s
|
[[:space:]] |
\u
|
[[:upper:]] |
\w
|
[[:word:]] |
\D
|
[^[:digit:]] |
\L
|
[^[:lower:]] |
\S
|
[^[:space:]] |
\U
|
[^[:upper:]] |
\W
|
[^[:word:]] |
Assertions
Besides ^
and $
, Perl regular expressions support the following zero-width assertions:
\<
|
Matches the start of a word. |
\>
|
Matches the end of a word. |
\b
|
Matches a word boundary (the start or end of a word). |
\B
|
Matches only when not at a word boundary. |
\A
|
Matches beginning of the file. |
\Z
|
Matches position of last non-newline character in the file. |
\z
|
Matches end of the file. |
For further information options on Perl regular please see:
See also: