re_syntax.n
上传用户:rrhhcc
上传日期:2015-12-11
资源大小:54129k
文件大小:25k
- '"
- '" Copyright (c) 1998 Sun Microsystems, Inc.
- '" Copyright (c) 1999 Scriptics Corporation
- '"
- '" See the file "license.terms" for information on usage and redistribution
- '" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
- '"
- '" RCS: @(#) $Id: re_syntax.n,v 1.3.32.1 2006/04/12 02:19:53 das Exp $
- '"
- .so man.macros
- .TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
- .BS
- .SH NAME
- re_syntax - Syntax of Tcl regular expressions.
- .BE
- .SH DESCRIPTION
- .PP
- A fIregular expressionfR describes strings of characters.
- It's a pattern that matches certain strings and doesn't match others.
- .SH "DIFFERENT FLAVORS OF REs"
- Regular expressions (``RE''s), as defined by POSIX, come in two
- flavors: fIextendedfR REs (``EREs'') and fIbasicfR REs (``BREs'').
- EREs are roughly those of the traditional fIegrepfR, while BREs are
- roughly those of the traditional fIedfR. This implementation adds
- a third flavor, fIadvancedfR REs (``AREs''), basically EREs with
- some significant extensions.
- .PP
- This manual page primarily describes AREs. BREs mostly exist for
- backward compatibility in some old programs; they will be discussed at
- the end. POSIX EREs are almost an exact subset of AREs. Features of
- AREs that are not present in EREs will be indicated.
- .SH "REGULAR EXPRESSION SYNTAX"
- .PP
- Tcl regular expressions are implemented using the package written by
- Henry Spencer, based on the 1003.2 spec and some (not quite all) of
- the Perl5 extensions (thanks, Henry!). Much of the description of
- regular expressions below is copied verbatim from his manual entry.
- .PP
- An ARE is one or more fIbranchesfR,
- separated by `fB|fR',
- matching anything that matches any of the branches.
- .PP
- A branch is zero or more fIconstraintsfR or fIquantified atomsfR,
- concatenated.
- It matches a match for the first, followed by a match for the second, etc;
- an empty branch matches the empty string.
- .PP
- A quantified atom is an fIatomfR possibly followed
- by a single fIquantifierfR.
- Without a quantifier, it matches a match for the atom.
- The quantifiers,
- and what a so-quantified atom matches, are:
- .RS 2
- .TP 6
- fB*fR
- a sequence of 0 or more matches of the atom
- .TP
- fB+fR
- a sequence of 1 or more matches of the atom
- .TP
- fB?fR
- a sequence of 0 or 1 matches of the atom
- .TP
- fB{fImfB}fR
- a sequence of exactly fImfR matches of the atom
- .TP
- fB{fImfB,}fR
- a sequence of fImfR or more matches of the atom
- .TP
- fB{fImfB,fInfB}fR
- a sequence of fImfR through fInfR (inclusive) matches of the atom;
- fImfR may not exceed fInfR
- .TP
- fB*? +? ?? {fImfB}? {fImfB,}? {fImfB,fInfB}?fR
- fInon-greedyfR quantifiers,
- which match the same possibilities,
- but prefer the smallest number rather than the largest number
- of matches (see MATCHING)
- .RE
- .PP
- The forms using
- fB{fR and fB}fR
- are known as fIboundfRs.
- The numbers
- fImfR and fInfR are unsigned decimal integers
- with permissible values from 0 to 255 inclusive.
- .PP
- An atom is one of:
- .RS 2
- .TP 6
- fB(fIrefB)fR
- (where fIrefR is any regular expression)
- matches a match for
- fIrefR, with the match noted for possible reporting
- .TP
- fB(?:fIrefB)fR
- as previous,
- but does no reporting
- (a ``non-capturing'' set of parentheses)
- .TP
- fB()fR
- matches an empty string,
- noted for possible reporting
- .TP
- fB(?:)fR
- matches an empty string,
- without reporting
- .TP
- fB[fIcharsfB]fR
- a fIbracket expressionfR,
- matching any one of the fIcharsfR (see BRACKET EXPRESSIONS for more detail)
- .TP
- fB.fR
- matches any single character
- .TP
- fBefIkfR
- (where fIkfR is a non-alphanumeric character)
- matches that character taken as an ordinary character,
- e.g. ee matches a backslash character
- .TP
- fBefIcfR
- where fIcfR is alphanumeric
- (possibly followed by other characters),
- an fIescapefR (AREs only),
- see ESCAPES below
- .TP
- fB{fR
- when followed by a character other than a digit,
- matches the left-brace character `fB{fR';
- when followed by a digit, it is the beginning of a
- fIboundfR (see above)
- .TP
- fIxfR
- where fIxfR is
- a single character with no other significance, matches that character.
- .RE
- .PP
- A fIconstraintfR matches an empty string when specific conditions
- are met.
- A constraint may not be followed by a quantifier.
- The simple constraints are as follows; some more constraints are
- described later, under ESCAPES.
- .RS 2
- .TP 8
- fB^fR
- matches at the beginning of a line
- .TP
- fB$fR
- matches at the end of a line
- .TP
- fB(?=fIrefB)fR
- fIpositive lookaheadfR (AREs only), matches at any point
- where a substring matching fIrefR begins
- .TP
- fB(?!fIrefB)fR
- fInegative lookaheadfR (AREs only), matches at any point
- where no substring matching fIrefR begins
- .RE
- .PP
- The lookahead constraints may not contain back references (see later),
- and all parentheses within them are considered non-capturing.
- .PP
- An RE may not end with `fBefR'.
- .SH "BRACKET EXPRESSIONS"
- A fIbracket expressionfR is a list of characters enclosed in `fB[|]fR'.
- It normally matches any single character from the list (but see below).
- If the list begins with `fB^fR',
- it matches any single character
- (but see below) fInotfR from the rest of the list.
- .PP
- If two characters in the list are separated by `fB-fR',
- this is shorthand
- for the full fIrangefR of characters between those two (inclusive) in the
- collating sequence,
- e.g.
- fB[0-9]fR
- in ASCII matches any decimal digit.
- Two ranges may not share an
- endpoint, so e.g.
- fBa-c-efR
- is illegal.
- Ranges are very collating-sequence-dependent,
- and portable programs should avoid relying on them.
- .PP
- To include a literal
- fB]fR
- or
- fB-fR
- in the list,
- the simplest method is to
- enclose it in
- fB[.fR and fB.]fR
- to make it a collating element (see below).
- Alternatively,
- make it the first character
- (following a possible `fB^fR'),
- or (AREs only) precede it with `fBefR'.
- Alternatively, for `fB-fR',
- make it the last character,
- or the second endpoint of a range.
- To use a literal
- fB-fR
- as the first endpoint of a range,
- make it a collating element
- or (AREs only) precede it with `fBefR'.
- With the exception of these, some combinations using
- fB[fR
- (see next
- paragraphs), and escapes,
- all other special characters lose their
- special significance within a bracket expression.
- .PP
- Within a bracket expression, a collating element (a character,
- a multi-character sequence that collates as if it were a single character,
- or a collating-sequence name for either)
- enclosed in
- fB[.fR and fB.]fR
- stands for the
- sequence of characters of that collating element.
- The sequence is a single element of the bracket expression's list.
- A bracket expression in a locale that has
- multi-character collating elements
- can thus match more than one character.
- .VS 8.2
- So (insidiously), a bracket expression that starts with fB^fR
- can match multi-character collating elements even if none of them
- appear in the bracket expression!
- (fINote:fR Tcl currently has no multi-character collating elements.
- This information is only for illustration.)
- .PP
- For example, assume the collating sequence includes a fBchfR
- multi-character collating element.
- Then the RE fB[[.ch.]]*cfR (zero or more fBchfP's followed by fBcfP)
- matches the first five characters of `fBchchccfR'.
- Also, the RE fB[^c]bfR matches all of `fBchbfR'
- (because fB[^c]fR matches the multi-character fBchfR).
- .VE 8.2
- .PP
- Within a bracket expression, a collating element enclosed in
- fB[=fR
- and
- fB=]fR
- is an equivalence class, standing for the sequences of characters
- of all collating elements equivalent to that one, including itself.
- (If there are no other equivalent collating elements,
- the treatment is as if the enclosing delimiters were `fB[.fR'&
- and `fB.]fR'.)
- For example, if
- fBofR
- and
- fBo'o^'fR
- are the members of an equivalence class,
- then `fB[[=o=]]fR', `fB[[=o'o^'=]]fR',
- and `fB[oo'o^']fR'&
- are all synonymous.
- An equivalence class may not be an endpoint
- of a range.
- .VS 8.2
- (fINote:fR
- Tcl currently implements only the Unicode locale.
- It doesn't define any equivalence classes.
- The examples above are just illustrations.)
- .VE 8.2
- .PP
- Within a bracket expression, the name of a fIcharacter classfR enclosed
- in
- fB[:fR
- and
- fB:]fR
- stands for the list of all characters
- (not all collating elements!)
- belonging to that
- class.
- Standard character classes are:
- .PP
- .RS
- .ne 5
- .ta 3c
- .nf
- fBalphafR A letter.
- fBupperfR An upper-case letter.
- fBlowerfR A lower-case letter.
- fBdigitfR A decimal digit.
- fBxdigitfR A hexadecimal digit.
- fBalnumfR An alphanumeric (letter or digit).
- fBprintfR An alphanumeric (same as alnum).
- fBblankfR A space or tab character.
- fBspacefR A character producing white space in displayed text.
- fBpunctfR A punctuation character.
- fBgraphfR A character with a visible representation.
- fBcntrlfR A control character.
- .fi
- .RE
- .PP
- A locale may provide others.
- .VS 8.2
- (Note that the current Tcl implementation has only one locale:
- the Unicode locale.)
- .VE 8.2
- A character class may not be used as an endpoint of a range.
- .PP
- There are two special cases of bracket expressions:
- the bracket expressions
- fB[[:<:]]fR
- and
- fB[[:>:]]fR
- are constraints, matching empty strings at
- the beginning and end of a word respectively.
- '" note, discussion of escapes below references this definition of word
- A word is defined as a sequence of
- word characters
- that is neither preceded nor followed by
- word characters.
- A word character is an
- fIalnumfR
- character
- or an underscore
- (fB_fR).
- These special bracket expressions are deprecated;
- users of AREs should use constraint escapes instead (see below).
- .SH ESCAPES
- Escapes (AREs only), which begin with a
- fBefR
- followed by an alphanumeric character,
- come in several varieties:
- character entry, class shorthands, constraint escapes, and back references.
- A
- fBefR
- followed by an alphanumeric character but not constituting
- a valid escape is illegal in AREs.
- In EREs, there are no escapes:
- outside a bracket expression,
- a
- fBefR
- followed by an alphanumeric character merely stands for that
- character as an ordinary character,
- and inside a bracket expression,
- fBefR
- is an ordinary character.
- (The latter is the one actual incompatibility between EREs and AREs.)
- .PP
- Character-entry escapes (AREs only) exist to make it easier to specify
- non-printing and otherwise inconvenient characters in REs:
- .RS 2
- .TP 5
- fBeafR
- alert (bell) character, as in C
- .TP
- fBebfR
- backspace, as in C
- .TP
- fBeBfR
- synonym for
- fBefR
- to help reduce backslash doubling in some
- applications where there are multiple levels of backslash processing
- .TP
- fBecfIXfR
- (where X is any character) the character whose
- low-order 5 bits are the same as those of
- fIXfR,
- and whose other bits are all zero
- .TP
- fBeefR
- the character whose collating-sequence name
- is `fBESCfR',
- or failing that, the character with octal value 033
- .TP
- fBeffR
- formfeed, as in C
- .TP
- fBenfR
- newline, as in C
- .TP
- fBerfR
- carriage return, as in C
- .TP
- fBetfR
- horizontal tab, as in C
- .TP
- fBeufIwxyzfR
- (where
- fIwxyzfR
- is exactly four hexadecimal digits)
- the Unicode character
- fBU+fIwxyzfR
- in the local byte ordering
- .TP
- fBeUfIstuvwxyzfR
- (where
- fIstuvwxyzfR
- is exactly eight hexadecimal digits)
- reserved for a somewhat-hypothetical Unicode extension to 32 bits
- .TP
- fBevfR
- vertical tab, as in C
- are all available.
- .TP
- fBexfIhhhfR
- (where
- fIhhhfR
- is any sequence of hexadecimal digits)
- the character whose hexadecimal value is
- fB0xfIhhhfR
- (a single character no matter how many hexadecimal digits are used).
- .TP
- fBe0fR
- the character whose value is
- fB0fR
- .TP
- fBefIxyfR
- (where
- fIxyfR
- is exactly two octal digits,
- and is not a
- fIback referencefR (see below))
- the character whose octal value is
- fB0fIxyfR
- .TP
- fBefIxyzfR
- (where
- fIxyzfR
- is exactly three octal digits,
- and is not a
- back reference (see below))
- the character whose octal value is
- fB0fIxyzfR
- .RE
- .PP
- Hexadecimal digits are `fB0fR'-`fB9fR', `fBafR'-`fBffR',
- and `fBAfR'-`fBFfR'.
- Octal digits are `fB0fR'-`fB7fR'.
- .PP
- The character-entry escapes are always taken as ordinary characters.
- For example,
- fBe135fR
- is
- fB]fR
- in ASCII,
- but
- fBe135fR
- does not terminate a bracket expression.
- Beware, however, that some applications (e.g., C compilers) interpret
- such sequences themselves before the regular-expression package
- gets to see them, which may require doubling (quadrupling, etc.) the `fBefR'.
- .PP
- Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used
- character classes:
- .RS 2
- .TP 10
- fBedfR
- fB[[:digit:]]fR
- .TP
- fBesfR
- fB[[:space:]]fR
- .TP
- fBewfR
- fB[[:alnum:]_]fR
- (note underscore)
- .TP
- fBeDfR
- fB[^[:digit:]]fR
- .TP
- fBeSfR
- fB[^[:space:]]fR
- .TP
- fBeWfR
- fB[^[:alnum:]_]fR
- (note underscore)
- .RE
- .PP
- Within bracket expressions, `fBedfR', `fBesfR',
- and `fBewfR'&
- lose their outer brackets,
- and `fBeDfR', `fBeSfR',
- and `fBeWfR'&
- are illegal.
- .VS 8.2
- (So, for example, fB[a-ced]fR is equivalent to fB[a-c[:digit:]]fR.
- Also, fB[a-ceD]fR, which is equivalent to fB[a-c^[:digit:]]fR, is illegal.)
- .VE 8.2
- .PP
- A constraint escape (AREs only) is a constraint,
- matching the empty string if specific conditions are met,
- written as an escape:
- .RS 2
- .TP 6
- fBeAfR
- matches only at the beginning of the string
- (see MATCHING, below, for how this differs from `fB^fR')
- .TP
- fBemfR
- matches only at the beginning of a word
- .TP
- fBeMfR
- matches only at the end of a word
- .TP
- fBeyfR
- matches only at the beginning or end of a word
- .TP
- fBeYfR
- matches only at a point that is not the beginning or end of a word
- .TP
- fBeZfR
- matches only at the end of the string
- (see MATCHING, below, for how this differs from `fB$fR')
- .TP
- fBefImfR
- (where
- fImfR
- is a nonzero digit) a fIback referencefR, see below
- .TP
- fBefImnnfR
- (where
- fImfR
- is a nonzero digit, and
- fInnfR
- is some more digits,
- and the decimal value
- fImnnfR
- is not greater than the number of closing capturing parentheses seen so far)
- a fIback referencefR, see below
- .RE
- .PP
- A word is defined as in the specification of
- fB[[:<:]]fR
- and
- fB[[:>:]]fR
- above.
- Constraint escapes are illegal within bracket expressions.
- .PP
- A back reference (AREs only) matches the same string matched by the parenthesized
- subexpression specified by the number,
- so that (e.g.)
- fB([bc])e1fR
- matches
- fBbbfR
- or
- fBccfR
- but not `fBbcfR'.
- The subexpression must entirely precede the back reference in the RE.
- Subexpressions are numbered in the order of their leading parentheses.
- Non-capturing parentheses do not define subexpressions.
- .PP
- There is an inherent historical ambiguity between octal character-entry
- escapes and back references, which is resolved by heuristics,
- as hinted at above.
- A leading zero always indicates an octal escape.
- A single non-zero digit, not followed by another digit,
- is always taken as a back reference.
- A multi-digit sequence not starting with a zero is taken as a back
- reference if it comes after a suitable subexpression
- (i.e. the number is in the legal range for a back reference),
- and otherwise is taken as octal.
- .SH "METASYNTAX"
- In addition to the main syntax described above, there are some special
- forms and miscellaneous syntactic facilities available.
- .PP
- Normally the flavor of RE being used is specified by
- application-dependent means.
- However, this can be overridden by a fIdirectorfR.
- If an RE of any flavor begins with `fB***:fR',
- the rest of the RE is an ARE.
- If an RE of any flavor begins with `fB***=fR',
- the rest of the RE is taken to be a literal string,
- with all characters considered ordinary characters.
- .PP
- An ARE may begin with fIembedded optionsfR:
- a sequence
- fB(?fIxyzfB)fR
- (where
- fIxyzfR
- is one or more alphabetic characters)
- specifies options affecting the rest of the RE.
- These supplement, and can override,
- any options specified by the application.
- The available option letters are:
- .RS 2
- .TP 3
- fBbfR
- rest of RE is a BRE
- .TP 3
- fBcfR
- case-sensitive matching (usual default)
- .TP 3
- fBefR
- rest of RE is an ERE
- .TP 3
- fBifR
- case-insensitive matching (see MATCHING, below)
- .TP 3
- fBmfR
- historical synonym for
- fBnfR
- .TP 3
- fBnfR
- newline-sensitive matching (see MATCHING, below)
- .TP 3
- fBpfR
- partial newline-sensitive matching (see MATCHING, below)
- .TP 3
- fBqfR
- rest of RE is a literal (``quoted'') string, all ordinary characters
- .TP 3
- fBsfR
- non-newline-sensitive matching (usual default)
- .TP 3
- fBtfR
- tight syntax (usual default; see below)
- .TP 3
- fBwfR
- inverse partial newline-sensitive (``weird'') matching (see MATCHING, below)
- .TP 3
- fBxfR
- expanded syntax (see below)
- .RE
- .PP
- Embedded options take effect at the
- fB)fR
- terminating the sequence.
- They are available only at the start of an ARE,
- and may not be used later within it.
- .PP
- In addition to the usual (fItightfR) RE syntax, in which all characters are
- significant, there is an fIexpandedfR syntax,
- available in all flavors of RE
- with the fB-expandedfR switch, or in AREs with the embedded x option.
- In the expanded syntax,
- white-space characters are ignored
- and all characters between a
- fB#fR
- and the following newline (or the end of the RE) are ignored,
- permitting paragraphing and commenting a complex RE.
- There are three exceptions to that basic rule:
- .RS 2
- .PP
- a white-space character or `fB#fR' preceded by `fBefR' is retained
- .PP
- white space or `fB#fR' within a bracket expression is retained
- .PP
- white space and comments are illegal within multi-character symbols
- like the ARE `fB(?:fR' or the BRE `fBe(fR'
- .RE
- .PP
- Expanded-syntax white-space characters are blank, tab, newline, and
- .VS 8.2
- any character that belongs to the fIspacefR character class.
- .VE 8.2
- .PP
- Finally, in an ARE,
- outside bracket expressions, the sequence `fB(?#fItttfB)fR'
- (where
- fItttfR
- is any text not containing a `fB)fR')
- is a comment,
- completely ignored.
- Again, this is not allowed between the characters of
- multi-character symbols like `fB(?:fR'.
- Such comments are more a historical artifact than a useful facility,
- and their use is deprecated;
- use the expanded syntax instead.
- .PP
- fINonefR of these metasyntax extensions is available if the application
- (or an initial
- fB***=fR
- director)
- has specified that the user's input be treated as a literal string
- rather than as an RE.
- .SH MATCHING
- In the event that an RE could match more than one substring of a given
- string,
- the RE matches the one starting earliest in the string.
- If the RE could match more than one substring starting at that point,
- its choice is determined by its fIpreferencefR:
- either the longest substring, or the shortest.
- .PP
- Most atoms, and all constraints, have no preference.
- A parenthesized RE has the same preference (possibly none) as the RE.
- A quantified atom with quantifier
- fB{fImfB}fR
- or
- fB{fImfB}?fR
- has the same preference (possibly none) as the atom itself.
- A quantified atom with other normal quantifiers (including
- fB{fImfB,fInfB}fR
- with
- fImfR
- equal to
- fInfR)
- prefers longest match.
- A quantified atom with other non-greedy quantifiers (including
- fB{fImfB,fInfB}?fR
- with
- fImfR
- equal to
- fInfR)
- prefers shortest match.
- A branch has the same preference as the first quantified atom in it
- which has a preference.
- An RE consisting of two or more branches connected by the
- fB|fR
- operator prefers longest match.
- .PP
- Subject to the constraints imposed by the rules for matching the whole RE,
- subexpressions also match the longest or shortest possible substrings,
- based on their preferences,
- with subexpressions starting earlier in the RE taking priority over
- ones starting later.
- Note that outer subexpressions thus take priority over
- their component subexpressions.
- .PP
- Note that the quantifiers
- fB{1,1}fR
- and
- fB{1,1}?fR
- can be used to force longest and shortest preference, respectively,
- on a subexpression or a whole RE.
- .PP
- Match lengths are measured in characters, not collating elements.
- An empty string is considered longer than no match at all.
- For example,
- fBbb*fR
- matches the three middle characters of `fBabbbcfR',
- fB(week|wee)(night|knights)fR
- matches all ten characters of `fBweeknightsfR',
- when
- fB(.*).*fR
- is matched against
- fBabcfR
- the parenthesized subexpression
- matches all three characters, and
- when
- fB(a*)*fR
- is matched against
- fBbcfR
- both the whole RE and the parenthesized
- subexpression match an empty string.
- .PP
- If case-independent matching is specified,
- the effect is much as if all case distinctions had vanished from the
- alphabet.
- When an alphabetic that exists in multiple cases appears as an
- ordinary character outside a bracket expression, it is effectively
- transformed into a bracket expression containing both cases,
- so that
- fBxfR
- becomes `fB[xX]fR'.
- When it appears inside a bracket expression, all case counterparts
- of it are added to the bracket expression, so that
- fB[x]fR
- becomes
- fB[xX]fR
- and
- fB[^x]fR
- becomes `fB[^xX]fR'.
- .PP
- If newline-sensitive matching is specified, fB.fR
- and bracket expressions using
- fB^fR
- will never match the newline character
- (so that matches will never cross newlines unless the RE
- explicitly arranges it)
- and
- fB^fR
- and
- fB$fR
- will match the empty string after and before a newline
- respectively, in addition to matching at beginning and end of string
- respectively.
- ARE
- fBeAfR
- and
- fBeZfR
- continue to match beginning or end of string fIonlyfR.
- .PP
- If partial newline-sensitive matching is specified,
- this affects fB.fR
- and bracket expressions
- as with newline-sensitive matching, but not
- fB^fR
- and `fB$fR'.
- .PP
- If inverse partial newline-sensitive matching is specified,
- this affects
- fB^fR
- and
- fB$fR
- as with
- newline-sensitive matching,
- but not fB.fR
- and bracket expressions.
- This isn't very useful but is provided for symmetry.
- .SH "LIMITS AND COMPATIBILITY"
- No particular limit is imposed on the length of REs.
- Programs intended to be highly portable should not employ REs longer
- than 256 bytes,
- as a POSIX-compliant implementation can refuse to accept such REs.
- .PP
- The only feature of AREs that is actually incompatible with
- POSIX EREs is that
- fBefR
- does not lose its special
- significance inside bracket expressions.
- All other ARE features use syntax which is illegal or has
- undefined or unspecified effects in POSIX EREs;
- the
- fB***fR
- syntax of directors likewise is outside the POSIX
- syntax for both BREs and EREs.
- .PP
- Many of the ARE extensions are borrowed from Perl, but some have
- been changed to clean them up, and a few Perl extensions are not present.
- Incompatibilities of note include `fBebfR', `fBeBfR',
- the lack of special treatment for a trailing newline,
- the addition of complemented bracket expressions to the things
- affected by newline-sensitive matching,
- the restrictions on parentheses and back references in lookahead constraints,
- and the longest/shortest-match (rather than first-match) matching semantics.
- .PP
- The matching rules for REs containing both normal and non-greedy quantifiers
- have changed since early beta-test versions of this package.
- (The new rules are much simpler and cleaner,
- but don't work as hard at guessing the user's real intentions.)
- .PP
- Henry Spencer's original 1986 fIregexpfR package,
- still in widespread use (e.g., in pre-8.1 releases of Tcl),
- implemented an early version of today's EREs.
- There are four incompatibilities between fIregexpfR's near-EREs
- (`RREs' for short) and AREs.
- In roughly increasing order of significance:
- .PP
- .RS
- In AREs,
- fBefR
- followed by an alphanumeric character is either an
- escape or an error,
- while in RREs, it was just another way of writing the
- alphanumeric.
- This should not be a problem because there was no reason to write
- such a sequence in RREs.
- .PP
- fB{fR
- followed by a digit in an ARE is the beginning of a bound,
- while in RREs,
- fB{fR
- was always an ordinary character.
- Such sequences should be rare,
- and will often result in an error because following characters
- will not look like a valid bound.
- .PP
- In AREs,
- fBefR
- remains a special character within `fB[|]fR',
- so a literal
- fBefR
- within
- fB[|]fR
- must be written `fBeefR'.
- fBeefR
- also gives a literal
- fBefR
- within
- fB[|]fR
- in RREs,
- but only truly paranoid programmers routinely doubled the backslash.
- .PP
- AREs report the longest/shortest match for the RE,
- rather than the first found in a specified search order.
- This may affect some RREs which were written in the expectation that
- the first match would be reported.
- (The careful crafting of RREs to optimize the search order for fast
- matching is obsolete (AREs examine all possible matches
- in parallel, and their performance is largely insensitive to their
- complexity) but cases where the search order was exploited to deliberately
- find a match which was fInotfR the longest/shortest will need rewriting.)
- .RE
- .SH "BASIC REGULAR EXPRESSIONS"
- BREs differ from EREs in several respects. `fB|fR', `fB+fR',
- and
- fB?fR
- are ordinary characters and there is no equivalent
- for their functionality.
- The delimiters for bounds are
- fBe{fR
- and `fBe}fR',
- with
- fB{fR
- and
- fB}fR
- by themselves ordinary characters.
- The parentheses for nested subexpressions are
- fBe(fR
- and `fBe)fR',
- with
- fB(fR
- and
- fB)fR
- by themselves ordinary characters.
- fB^fR
- is an ordinary character except at the beginning of the
- RE or the beginning of a parenthesized subexpression,
- fB$fR
- is an ordinary character except at the end of the
- RE or the end of a parenthesized subexpression,
- and
- fB*fR
- is an ordinary character if it appears at the beginning of the
- RE or the beginning of a parenthesized subexpression
- (after a possible leading `fB^fR').
- Finally,
- single-digit back references are available,
- and
- fBe<fR
- and
- fBe>fR
- are synonyms for
- fB[[:<:]]fR
- and
- fB[[:>:]]fR
- respectively;
- no other escapes are available.
- .SH "SEE ALSO"
- RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
- .SH KEYWORDS
- match, regular expression, string