Difference between revisions of "Perl regular expressions"

Revision as of 20:10, 9 August 2017

UltraEdit supports Perl style regular expressions for search using the Boost C++ Libraries.

Note: in the following documentation, atom may refer to a single character, a marked sub-expression, or a character class.

In Perl regular expressions, all characters match themselves except for the following special characters:

Character	Meaning
.	Matches any single character except new lines
^	Matches start of line (anchor)
$	Matches end of line (anchor)
*	Matches 0 or more of the preceding atom
+	Matches 1 or more of the preceding atom
?	Matches 0 or 1 of the preceding atom
[]	Matches any character in the set. For example [a-d] would match a, b, c, and d – but not e.
()	Tags the enclosed atom for backreferencing
{n}	Matches the previous atom n times.
\|	"or" operand. For example dog\|cat would match both "dog" and "cat."
\	Escape character

Marked sub-expressions
A section beginning ( and ending ) is a marked sub-expression. Whatever matched the sub-expression is split out in a separate field by the matching algorithms. Marked sub-expressions can also be repeated or referred to by a backreference.

Alternation
The | operator will match either of its arguments, so for example: abc|def will match either "abc" or "def". Parenthesis can be used to group alternations, for example: ab(d|ef) will match either of "abd" or "abef". Empty alternatives are not allowed (these are almost always a mistake), but if you really want an empty alternative use (?:) as a placeholder, for example:

"|abc" is not a valid expression, but
"(?:)|abc" is and is equivalent, also the expression:
"(?:abc)??" has exactly the same effect.

Character sets
A character set is a bracket-expression starting with [ and ending with ], it defines a set of characters, and matches any single character that is a member of that set. A bracket expression may contain any combination of the following:

Single characters: For example [abc], will match any of the characters 'a', 'b', or 'c'.
Character ranges: For example [a-c] will match any single character in the range 'a' to 'c'. By default, for POSIX-Perl regular expressions, a character x is within the range y to z, if it collates within that range; this results in locale specific behavior.
Negation: If the bracket-expression begins with the ^ character, then it matches the complement of the characters it contains, for example [^a-c] matches any character that is not in the range a-c.
Character classes: An expression of the form [[:name:]] matches the named character class "name", for example [[:lower:]] matches any lower case character.

Supported dcharacter class names (following the format [[:name:]])

Name	POSIX-standard	Description
alnum	Yes	Any alpha-numeric character.
alpha	Yes	Any alphabetic character.
blank	Yes	Any whitespace character that is not a line separator.
cntrl	Yes	Any control character.
d	No	Any decimal digit
digit	Yes	Any decimal digit.
graph	Yes	Any graphical character.
l	No	Any lower case character.
lower	Yes	Any lower case character.
print	Yes	Any printable character.
punct	Yes	Any punctuation character.
s	No	Any whitespace character.
space	Yes	Any whitespace character.
unicode	No	Any extended character whose code point is above 255 in value.
u	No	Any upper case character.
upper	Yes	Any upper case character.
w	No	Any word character (alphanumeric characters plus the underscore).
word	No	Any word character (alphanumeric characters plus the underscore).
xdigit	Yes	Any hexadecimal digit character.

Escapes
Any special character preceded by an escape matches itself. The following escape sequences are also supported (all synonyms for single characters):

Escape	Character
\a	'\a'
\e	0x1B
\f	\f
\n	\n
\r	\r
\t	\t
\v	\v
\b	\b (but only inside a character class declaration).
\cX	An ASCII escape sequence - the character whose code point is X % 32
\xdd	A hexadecimal escape sequence - matches the single character whose code point is 0xdd.
\x{dddd}	A hexadecimal escape sequence - matches the single character whose code point is 0xdddd.
\0ddd	An octal escape sequence - matches the single character whose code point is 0ddd.
\N{name}	Matches the single character which has the symbolic name name. For example \N{newline} matches the single character \n.

"Single character" character classes
Any escaped character x, if x is the name of a character class shall match any character that is a member of that class, and any escaped character X, if x is the name of a character class, shall match any character not in that class. The following are supported by default:

Escape sequence	Equivalent to
\d	[[:digit:]]
\l	[[:lower:]]
\s	[[:space:]]
\u	[[:upper:]]
\w	[[:word:]]
\D	[^[:digit:]]
\L	[^[:lower:]]
\S	[^[:space:]]
\U	[^[:upper:]]
\W	[^[:word:]]

Word Boundaries
The following escape sequences match the boundaries of words:

\<	Matches the start of a word.
\>	Matches the end of a word.
\b	Matches a word boundary (the start or end of a word).
\B	Matches only when not at a word boundary.

For further information options on Perl regular please see:

Categories

Popular topics

Difference between revisions of "Perl regular expressions"

Revision as of 20:10, 9 August 2017