pcretest.1
上传用户:yhdzpy8989
上传日期:2007-06-13
资源大小:13604k
文件大小:11k
- .TH PCRETEST 1
- .SH NAME
- pcretest - a program for testing Perl-compatible regular expressions.
- .SH SYNOPSIS
- .B pcretest "[-d] [-i] [-m] [-o osize] [-p] [-t] [source] [destination]"
- fBpcretestfR was written as a test program for the PCRE regular expression
- library itself, but it can also be used for experimenting with regular
- expressions. This man page describes the features of the test program; for
- details of the regular expressions themselves, see the fBpcrefR man page.
- .SH OPTIONS
- .TP 10
- fB-dfR
- Behave as if each regex had the fB/DfR modifier (see below); the internal
- form is output after compilation.
- .TP 10
- fB-ifR
- Behave as if each regex had the fB/IfR modifier; information about the
- compiled pattern is given after compilation.
- .TP 10
- fB-mfR
- Output the size of each compiled pattern after it has been compiled. This is
- equivalent to adding /M to each regular expression. For compatibility with
- earlier versions of pcretest, fB-sfR is a synonym for fB-mfR.
- .TP 10
- fB-ofR fIosizefR
- Set the number of elements in the output vector that is used when calling PCRE
- to be fIosizefR. The default value is 45, which is enough for 14 capturing
- subexpressions. The vector size can be changed for individual matching calls by
- including \O in the data line (see below).
- .TP 10
- fB-pfR
- Behave as if each regex has fB/PfR modifier; the POSIX wrapper API is used
- to call PCRE. None of the other options has any effect when fB-pfR is set.
- .TP 10
- fB-tfR
- Run each compile, study, and match 20000 times with a timer, and output
- resulting time per compile or match (in milliseconds). Do not set fB-tfR with
- fB-mfR, because you will then get the size output 20000 times and the timing
- will be distorted.
- .SH DESCRIPTION
- If fBpcretestfR is given two filename arguments, it reads from the first and
- writes to the second. If it is given only one filename argument, it reads from
- that file and writes to stdout. Otherwise, it reads from stdin and writes to
- stdout, and prompts for each line of input, using "re>" to prompt for regular
- expressions, and "data>" to prompt for data lines.
- The program handles any number of sets of input on a single input file. Each
- set starts with a regular expression, and continues with any number of data
- lines to be matched against the pattern. An empty line signals the end of the
- data lines, at which point a new regular expression is read. The regular
- expressions are given enclosed in any non-alphameric delimiters other than
- backslash, for example
- /(a|bc)x+yz/
- White space before the initial delimiter is ignored. A regular expression may
- be continued over several input lines, in which case the newline characters are
- included within it. It is possible to include the delimiter within the pattern
- by escaping it, for example
- /abc\/def/
- If you do so, the escape and the delimiter form part of the pattern, but since
- delimiters are always non-alphameric, this does not affect its interpretation.
- If the terminating delimiter is immediately followed by a backslash, for
- example,
- /abc/\
- then a backslash is added to the end of the pattern. This is done to provide a
- way of testing the error condition that arises if a pattern finishes with a
- backslash, because
- /abc\/
- is interpreted as the first line of a pattern that starts with "abc/", causing
- pcretest to read the next line as a continuation of the regular expression.
- .SH PATTERN MODIFIERS
- The pattern may be followed by fBifR, fBmfR, fBsfR, or fBxfR to set the
- PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options,
- respectively. For example:
- /caseless/i
- These modifier letters have the same effect as they do in Perl. There are
- others which set PCRE options that do not correspond to anything in Perl:
- fB/AfR, fB/EfR, and fB/XfR set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and
- PCRE_EXTRA respectively.
- Searching for all possible matches within each subject string can be requested
- by the fB/gfR or fB/GfR modifier. After finding a match, PCRE is called
- again to search the remainder of the subject string. The difference between
- fB/gfR and fB/GfR is that the former uses the fIstartoffsetfR argument to
- fBpcre_exec()fR to start searching at a new point within the entire string
- (which is in effect what Perl does), whereas the latter passes over a shortened
- substring. This makes a difference to the matching process if the pattern
- begins with a lookbehind assertion (including \b or \B).
- If any call to fBpcre_exec()fR in a fB/gfR or fB/GfR sequence matches an
- empty string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED
- flags set in order to search for another, non-empty, match at the same point.
- If this second match fails, the start offset is advanced by one, and the normal
- match is retried. This imitates the way Perl handles such cases when using the
- fB/gfR modifier or the fBsplit()fR function.
- There are a number of other modifiers for controlling the way fBpcretestfR
- operates.
- The fB/+fR modifier requests that as well as outputting the substring that
- matched the entire pattern, pcretest should in addition output the remainder of
- the subject string. This is useful for tests where the subject contains
- multiple copies of the same substring.
- The fB/LfR modifier must be followed directly by the name of a locale, for
- example,
- /pattern/Lfr
- For this reason, it must be the last modifier letter. The given locale is set,
- fBpcre_maketables()fR is called to build a set of character tables for the
- locale, and this is then passed to fBpcre_compile()fR when compiling the
- regular expression. Without an fB/LfR modifier, NULL is passed as the tables
- pointer; that is, fB/LfR applies only to the expression on which it appears.
- The fB/IfR modifier requests that fBpcretestfR output information about the
- compiled expression (whether it is anchored, has a fixed first character, and
- so on). It does this by calling fBpcre_fullinfo()fR after compiling an
- expression, and outputting the information it gets back. If the pattern is
- studied, the results of that are also output.
- The fB/DfR modifier is a PCRE debugging feature, which also assumes fB/IfR.
- It causes the internal form of compiled regular expressions to be output after
- compilation.
- The fB/SfR modifier causes fBpcre_study()fR to be called after the
- expression has been compiled, and the results used when the expression is
- matched.
- The fB/MfR modifier causes the size of memory block used to hold the compiled
- pattern to be output.
- The fB/PfR modifier causes fBpcretestfR to call PCRE via the POSIX wrapper
- API rather than its native API. When this is done, all other modifiers except
- fB/ifR, fB/mfR, and fB/+fR are ignored. REG_ICASE is set if fB/ifR is
- present, and REG_NEWLINE is set if fB/mfR is present. The wrapper functions
- force PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.
- The fB/8fR modifier causes fBpcretestfR to call PCRE with the PCRE_UTF8
- option set. This turns on the (currently incomplete) support for UTF-8
- character handling in PCRE, provided that it was compiled with this support
- enabled. This modifier also causes any non-printing characters in output
- strings to be printed using the \x{hh...} notation if they are valid UTF-8
- sequences.
- .SH DATA LINES
- Before each data line is passed to fBpcre_exec()fR, leading and trailing
- whitespace is removed, and it is then scanned for \ escapes. The following are
- recognized:
- \a alarm (= BEL)
- \b backspace
- \e escape
- \f formfeed
- \n newline
- \r carriage return
- \t tab
- \v vertical tab
- \nnn octal character (up to 3 octal digits)
- \xhh hexadecimal character (up to 2 hex digits)
- \x{hh...} hexadecimal UTF-8 character
- \A pass the PCRE_ANCHORED option to fBpcre_exec()fR
- \B pass the PCRE_NOTBOL option to fBpcre_exec()fR
- \Cdd call pcre_copy_substring() for substring dd
- after a successful match (any decimal number
- less than 32)
- \Gdd call pcre_get_substring() for substring dd
- after a successful match (any decimal number
- less than 32)
- \L call pcre_get_substringlist() after a
- successful match
- \N pass the PCRE_NOTEMPTY option to fBpcre_exec()fR
- \Odd set the size of the output vector passed to
- fBpcre_exec()fR to dd (any number of decimal
- digits)
- \Z pass the PCRE_NOTEOL option to fBpcre_exec()fR
- When \O is used, it may be higher or lower than the size set by the fB-OfR
- option (or defaulted to 45); \O applies only to the call of fBpcre_exec()fR
- for the line in which it appears.
- A backslash followed by anything else just escapes the anything else. If the
- very last character is a backslash, it is ignored. This gives a way of passing
- an empty line as data, since a real empty line terminates the data input.
- If fB/PfR was present on the regex, causing the POSIX wrapper API to be used,
- only fBBfR, and fBZfR have any effect, causing REG_NOTBOL and REG_NOTEOL
- to be passed to fBregexec()fR respectively.
- The use of \x{hh...} to represent UTF-8 characters is not dependent on the use
- of the fB/8fR modifier on the pattern. It is recognized always. There may be
- any number of hexadecimal digits inside the braces. The result is from one to
- six bytes, encoded according to the UTF-8 rules.
- .SH OUTPUT FROM PCRETEST
- When a match succeeds, pcretest outputs the list of captured substrings that
- fBpcre_exec()fR returns, starting with number 0 for the string that matched
- the whole pattern. Here is an example of an interactive pcretest run.
- $ pcretest
- PCRE version 2.06 08-Jun-1999
- re> /^abc(\d+)/
- data> abc123
- 0: abc123
- 1: 123
- data> xyz
- No match
- If the strings contain any non-printing characters, they are output as \0x
- escapes, or as \x{...} escapes if the fB/8fR modifier was present on the
- pattern. If the pattern has the fB/+fR modifier, then the output for
- substring 0 is followed by the the rest of the subject string, identified by
- "0+" like this:
- re> /cat/+
- data> cataract
- 0: cat
- 0+ aract
- If the pattern has the fB/gfR or fB/GfR modifier, the results of successive
- matching attempts are output in sequence, like this:
- re> /\Bi(\w\w)/g
- data> Mississippi
- 0: iss
- 1: ss
- 0: iss
- 1: ss
- 0: ipp
- 1: pp
- "No match" is output only if the first match attempt fails.
- If any of the sequences fB\CfR, fB\GfR, or fB\LfR are present in a
- data line that is successfully matched, the substrings extracted by the
- convenience functions are output with C, G, or L after the string number
- instead of a colon. This is in addition to the normal full list. The string
- length (that is, the return from the extraction function) is given in
- parentheses after each string for fB\CfR and fB\GfR.
- Note that while patterns can be continued over several lines (a plain ">"
- prompt is used for continuations), data lines may not. However newlines can be
- included in data by means of the \n escape.
- .SH AUTHOR
- Philip Hazel <ph10@cam.ac.uk>
- .br
- University Computing Service,
- .br
- New Museums Site,
- .br
- Cambridge CB2 3QG, England.
- .br
- Phone: +44 1223 334714
- Last updated: 15 August 2001
- .br
- Copyright (c) 1997-2001 University of Cambridge.