资源说明:Memory-based shallow parser for Python
-----
## Documentation
- [Introduction](#Introduction)
- [Installation instructions](#installation)
- [The parser](#parser)
- [The tokenizer](#tokenizer)
- [The lemmatizer](#lemmatizer)
- [The PP-attacher](#pp-attacher)
- [Parse trees](#tree)
- [Clients and servers](#server)
- [Configuration](#configuration)
- [Command-line interface](#command_line)
- [Extending MBSP](#customization)
- [Exporting to XML, NLTK, GraphViz](#export)
- [Licensing](#license)
-----
## Introduction
### Quick overview
MBSP parses a string of characters into words and sentences, and determines the grammatical structure of the sentence.
It is a Python module, so you'll need [Python](http://www.python.org/download/) to run it (already installed on Mac OS X).
The module uses a client-server architecture for performance.
It includes binaries ([TiMBL](http://ilk.uvt.nl/timbl/), [MBT](http://ilk.uvt.nl/mbt/) and [MBLEM](http://ilk.uvt.nl/mbma/)) precompiled for Mac OS X, so on Mac it works out-of-the-box.
Otherwise, if you're on a Unix system, the module has a setup.py file that should compile everything for you. Go to the terminal and type:
```
cd MBSP
python setup.py
```
If that doesn't work you'll need to follow the steps in the [installation instructions](/pages/MBSP#installation).
Put the MBSP folder in the same folder as your Python script and import the module. By default, the servers are configured to start automatically. Once they are up and running you can use the `parse()` command to analyze texts:
```python
import MBSP
print MBSP.parse('cats with hats')
>>> cats/NNS/I-NP/O/O/A1/cat with/IN/I-PP/B-PNP/O/P1/with hats/NNS/I-NP/I-PNP/O/P1/hat
```
Each word has been *tagged* with grammatical information. For example, MBSP determined that *cats* is a plural noun (NNS). It has a prepositional noun phrase (PNP) attached to it (A1 is the anchor of P1), so the *hats* go *with* the *cats*.
For a human this might seem pretty straightforward, but consider that without any analysis, for a machine the sentence is just a sequence of characters with no meaning.
The tag codes may seem cryptic at first, but consider that it is more concise to say NNS than PLURAL NOUN over and over.
The tag codes are common in natural language processing, it's a good idea to [get acquainted with them](/pages/MBSP-tags).
Something went wrong? Probably the servers didn't have enough time to start:
```python
MBSP.start(timeout=120)
print(MBSP.parse('cats with hats'))
```
The output of the `parse()` command is a tagged string that can be manipulated in many ways.
With the `split()` command it can be transformed into a tree of linked Python objects:
```python
s = MBSP.parse('black cats with striped hats')
s = MBSP.split(s)
for sentence in s:
for chunk in sentence.chunks:
print([word.lemma for word in chunk.words], chunk.attachments)
>>> [u'black', u'cat'] [Chunk('with striped hats/PNP')]
>>> [u'with'] []
>>> [u'striped', u'hat'] []
```
With the `xml()` command it can be transformed into an [XML](http://en.wikipedia.org/wiki/XML) string for processing outside of Python:
```python
s = parse('black cats with striped hats')
print xml(s)
```
### Purpose
MBSP stands for "Memory-Based Shallow Parser". [Shallow
parsing](/pages/mbsp-shallow-parsing) (i.e. automatic discovery of a
sentence constituents) is an important component of many text analysis
systems, in applications such as information extraction and summary
generation. The Memory-Based Learning (MBL) approach has the advantage
of avoiding the need for manual definition of patterns (for example,
using regular expression syntax) and of being reusable across different
corpora and sublanguages.
MBSP is a so-called *lazy learner*: it keeps all the initial training
data available (including exceptions which may sometimes be productive).
This technique has been shown to achieve higher accuracy than *eager*
(or *greedy*) methods for many language processing tasks. For the Wall
Street Journal corpus (WSJ), accuracy (Fβ=1) is 96.4% for part-of-speech
tagging, 93.8% for NP chunking, 94.7% for
VP chunking, 77.1% for
SBJ detection, 79.0% for
OBJ detection, and 82.7% for
PP-attachment. MBSP is based on the IB1-IG
and IGTREE algorithms bundled in our MBL software package, called TiMBL.
Reference: Daelemans,
W., Buchholz, S., & Veenstra, J. (1999).
*Memory-Based Shallow Parsing*. In: Proceedings of CoNLL, Bergen,
Norway.
The parser provides functionality for tokenization and sentence
splitting, part-of-speech tagging, chunking, relation finding,
prepositional phrase attachment and lemmatization.
- **Tokenization**: splits sentence periods and punctuation marks from
words.
- **Tagging**: assigns part-of-speech tags to words (e.g. *cat* → noun
→ NN, *eat* → verb →
VB).
- **Chunking**: assigns chunk tags to groups of words (e.g. *the black
cat* → noun phrase → NP).
- **Relation finder**: finds relations between chunks, sentence
subject, object and predicates.
- **PNP finder**: finds prepositional noun phrases (e.g. *under the
table*).
- **PP-attachment**: finds prepositional noun phrase anchors (e.g.
*eat pizza* → *with fork*).
- **Lemmatization**: finds word lemmata (e.g. *was* → *be*).
### Grammar basics
Sentences are made up of words. Words have a syntactic role (noun, verb,
adjective, ...) depending on their location in the sentence. For
example, *can* can be a verb or a noun, depending on the context (*the
can*, *I can*).
- **Sentence**: the basic unit of writing, expected to have a subject
and a predicate.
- **Word**: a string of characters that expresses a meaningful
concept.
- **Token**: a specific word with grammatical tags: *the
can*/NN, *I
can*/VB.
- **Chunk**: a group of words (phrase) that contains a single thought
(e.g. *a sumptuous banquet*).
- **Head**: the word that determines the syntactic type of the chunk:
*the black cat* →
NP.
- **Subject**: the person/thing *doing* or *being*, usually a noun
phrase (NP):
*the cat is black*.
- **Predicate**: the remainder of the sentence tells us what the
subject does: *the cat
sits on the mat*.
- **Clause**: subject + predicate.
- **Argument**: a chunk that is related to a verb in a clause, i.e.
subject and object.
- **Object**: the person/thing affected by the action: *the cat eats
fish*. Poor fish.
- **Preposition**: temporal, spatial or logical relationship: *the cat
sits on the mat*.
- **Copula**: a word used to link subject and predicate, typically the
verb *to be*.
- **Lemma**: canonical form of a word: *run*, *runs*, *running* are
part of a lexeme, *run* is the lemma.
- **POS**: part-of-speech, the syntactic role that a word or phrase
plays in a sentence, e.g. adjective =
JJ.
### Acknowledgements
This version of MBSP has been developed by the computational linguistics
group of CLiPS (Computational Linguistics & Psycholinguistics,
department of Linguistics, University of Antwerp, Belgium) on the basis
of earlier versions developed at the University of Antwerp and Tilburg
University.
Contributing authors: Walter Daelemans, Jakub Zavrel, Sabine Buchholz,
Jorn Veenstra, Antal van den Bosch, Ko van der Sloot, Bertjan Busser,
Erik F. Tjong Kim Sang, Jo Meyhi, Vincent Van Asch, Tom De Smedt.
For reference you can use:
De Smedt T., Van Asch V. & Daelemans, W. (2010).
*Memory-based Shallow Parser for Python*.
CLiPS Technical Report Series (CTRS), vol. 2.
ISSN 2033-3544.
\[[PDF](http://www.clips.ua.ac.be/sites/default/files/ctrs-002.pdf)\]
-----
## Installation
MBSP is a [Python](http://www.python.org/download/) module. On Mac OS X,
the Python programming language is already installed. On other systems
you need to download and install it yourself, if necessary. MBSP works
with Python 2.5, but support for processes is better in version 2.6. It
should also work with version 2.4.
MBSP is bundled with three required dependencies written in C/C++ (TiMBL
6.1.5, MBT 3.1.3 and MBLEM). Binaries have been precompiled for Mac OS X
10.5, but these may not work on your machine. In that case you need to
compile binaries manually from the source code.
### Compiling with setup.py
The module comes with a `setup.py` script
that compiles the C/C++ binaries automatically. If this works for you,
you're in luck – no manual compilation is necessary. Also, if you run
`setup.py` with the
`install` argument, it will first compile
the binaries and then install a copy of MBSP in Python's
`/site-packages` folder so that the
module is available in any Python script.
```
> cd MBSP
> python setup.py install
```
### Compiling from source
You'll need a [gcc](http://gcc.gnu.org/) compiler. On Windows you'll
need [cygwin](http://www.cygwin.com/).
In the cygwin installer (`setup.exe`), be
sure to select the "`devel`" packages for
installation.
#### Building MBLEM
- Go to the `MBSP/mblem` folder.
- Delete all files with a "`.o`"
extension + the current executable binary
`mblem_english_bmt`.
- From the command line, do `make` in
the `MBSP/mblem` folder:
```
> cd MBSP/mblem
> make
```
#### Building TiMBL
- Go to the `MBSP/timbl` folder.
- Uncompress the source code from the
`timbl-6.1.5.tar` archive.
- From the command line, do `configure`
and `make` in the
`MBSP/timbl/timbl-6.1.5` folder:
```
> cd MBSP/timbl/timbl-6.1.5
> ./configure --enable-shared=no --enable-static=no --prefix=[FOLDER]
> make install
```
- `[FOLDER]` is an absolute path to
the folder where Timbl will be built.
- The Timbl executable will be in
`[FOLDER]/bin`
→ copy it to `MBSP/timbl`.
- Now build MBT in the same `[FOLDER]`
location:
#### Building MBT
- Go to the `MBSP/mbt `folder.
- Uncompress the source code from the
`mbt-3.1.3.tar` archive.
- From the command line, do `configure`
and `make` in the
`MBSP/mbt/mbt-3.1.3` folder:
```
> cd MBSP/mbt/mbt-3.1.3
> ./configure --enable-shared=no --enable-static=no --prefix=[FOLDER]
> make install
```
- The Mbt executable will be in `[FOLDER]` → copy it to `MBSP/mbt`.
- Delete the build `[FOLDER]`, it is no longer needed.
### Module folder location
To be able to `import MBSP` in your
scripts, Python needs to know where the module is located.
There are three basic ways to accomplish this:
- Put the MBSP folder in the same folder as your script.
- Put the MBSP folder in the standard location for modules so it is
available to *all* scripts.
The standard location depends on your operating system, for
example:
`/Library/Python/2.5/site-packages/`
on Mac,
`/usr/lib/python2.5/site-packages/`
on Unix,
c:\\python25\\Lib\\site-packages\\
on Windows.
- Add the location of MBSP to the
`sys.path` list in your script,
before importing it:
```python
>>> MODULE = '/users/tom/desktop/MBSP'
>>> import sys; if MODULE not in sys.path: sys.path.append(MODULE)
>>> import MBSP
```
### Memory usage
MBSP starts four data servers that require quite a bit of memory
(`CHUNK`: 80MB,
`LEMMA`: 10MB,
`RELATION`: 160MB,
`PREPOSITION`: 210MB). Only the
`CHUNK` server (which gives you the
part-of-speech tags) is mandatory. The optional servers can be disabled
in `config.py` to reduce the memory
usage, for example:
```python
servers = ['chunk', 'lemma']
```
### Multithreading
MBSP can be configured to work with multithreading, which can increase
performance by 25% - 200%.
`MBSP.config.threading` needs to be set
to `True`. You also need to build the
newer TiMBL 6.3+, MBT 3.2+ and TimblServer 2+ from
[source](http://ilk.uvt.nl/). The installation instructions are mostly
the same:
- TimblServer can be compiled in the same way as MBT.
- Put the `TimblServer` executable in
`MBSP/timbl` instead of the
`Timbl` executable.
Older systems may complain that
`pkg-config` is outdated. In this case,
before building Timbl, compile pkg-config 0.25+
() from source with:
sudo ./configure; sudo make; sudo make
install
We have precompiled versions of TimblServer which may also work on your
system (in that case you don't need to install from source): download
for [Mac OS X 10.5](/media/MBSP-TimblServer1.0.0-OSX10.5.zip) | [Ubuntu
10.04 (64-bit)](/media/MBSP-TimblServer2.0.0-Ubuntu10.04.zip).
-----
## Parser
MBSP uses a client-server architecture. This way, the corpus data is
loaded only once (during server startup) and stays available while the
servers sleep in the background. Before tagging jobs can be sent to the
parser, the servers have to be started. By default, this will happen
automatically when you import MBSP in your script. Otherwise, the
`start()` command starts the four servers
(named `CHUNK`,
`LEMMA`,
`RELATION` and
`PREPOSITION`). The
`started()` command yields True if a
given server has been started. The
`stop()` command will stop the servers.
```python
MBSP.start(timeout=60)
```
```python
MBSP.started(name=ALL) # CHUNK | LEMMA | RELATION | PREPOSITION
```
```python
MBSP.stop()
```
The `parse()` command takes a string of
sentences and returns a tagged Unicode string.
Sentences in the output are separated by newline characters.
```python
MBSP.parse(string,
tokenize = True,
tags = True,
chunks = True,
relations = True,
anchors = True,
lemmata = True,
encoding = 'utf-8')
```
For example:
```python
print(MBSP.parse('I ate pizza with a fork.'))
>>> I/PRP/I-NP/O/NP-SBJ-1/O/i
ate/VBD/I-VP/O/VP-1/A1/eat
pizza/NN/I-NP/O/NP-OBJ-1/O/pizza
with/IN/I-PP/B-PNP/O/P1/with
a/DT/I-NP/I-PNP/O/P1/a
fork/NN/I-NP/I-PNP/O/P1/fork ././O/O/O/O/.
```
Each token (i.e. tagged word) in a sentence has a number of annotations:
`tags=True`
includes the word part-of-speech tag,
`chunks` the chunk tag +
PNP tag (prepositional noun phrase),
`relations` the chunk relation tag,
`anchors` the
PNP anchor tag. With
`tokenize` set to False, no tokenization
is carried out (so the input string is expected to be tokenized). The
`encoding` parameter defines the
character encoding of the input string, "utf-8" is fine in most cases.
### Parser tags
Let's examine the word *ate* and the tags assigned by the parser in the
example above:
| | | | | | | |
| :--: | :------------------------------: | :------------------------------: | :---------------------------: | :------------------------------: | :----------------------------: | :---: |
| word | part-of-speech | chunk | pnp | relation | anchor | lemma |
| ate | VBD | I-VP | O | VP-1 | A1 | eat |
The word's part-of-speech tag is VBD, which
means that it is a verb in the past tense. The word occurs in a
VP chunk, a verb phrase. It is not part of a
prepositional noun phrase. It's relation tag is
VP-1, which means it is linked to the words
tagged as NP-SBJ-1 (*I*, the sentence
subject) and NP-OBJ-1 (*pizza*, the sentence
object). Its anchor tag is A1, meaning it is
the anchor of the prepositional noun phrase
P1, *with a fork*. How did I eat pizza? →
with a fork. The base form or lemma of *ate* is *eat*.
Common part-of-speech tags include NN
(noun), JJ (adjective) and
VB (verb).
Common chunk tags include NP (noun phrase)
and VP (verb phrase).
Common relations include SBJ (subject) and
OBJ (object).
The [MBSP tags](/pages/MBSP-tags) page gives an overview of all the
possible tags generated by the parser.
A description and an example can also be acquired with the
`taginfo()` command:
```python
>>> description, example = MBSP.taginfo('NN')
>>> print(description, example)
('noun, singular or mass', 'tiger, chair, laughter')
```
### Parser shortcuts
Below is a set of concise commands that internally call
`parse()` with the required parameters.
- The `tag()` command returns a string
annotated with part-of-speech tags.
- The `chunk()` command returns a
string annotated with part-of-speech, chunk and
PNP tags.
- The `lemma()` command takes a single
word and returns its lemma.
- The `nouns()`,
`verbs()` or
`adjectives()`
` ` command returns a list of nouns,
verbs or adjectives retrieved from the given string.
```python
MBSP.tokenize(string)
```
```python
MBSP.tag(string, tokenize=True, lemmata=False)
```
```python
MBSP.chunk(string, tokenize=True, lemmata=False)
```
```python
MBSP.lemma(word)
```
```python
MBSP.nouns(string, lemmatize=False)
MBSP.verbs(string, lemmatize=False)
MBSP.adjectives(string, lemmatize=False)
```
Like `parse()`, all of these commands
have an optional `encoding` parameter
that is "utf-8" by default.
### Parser output
The output of the `parse()` command is a
string of sentences in which each token has been annotated with the
requested tags. The `pprint()` command
(extra *p* is for *pretty*) gives a good overview of the tags:
```python
s = MBSP.parse('I eat pizza.')
MBSP.pprint(s)
>>> WORD TAG CHUNK ROLE ID PNP ANCHOR LEMMA
I PRP NP SBJ 1 - - i
eat VBP VP - 1 - - eat
pizza NN NP OBJ 1 - - pizza
. . - - - - - .
```
The output string is in fact a
`TokenString` object that behaves as a
Python string, but with extra functionality. Most notably it has a
`TokenString.split()` method that yields
a `TokenList` object: a list of
sentences, where each sentence is a list of tokens, in which each token
is a list of tags. This is useful when you want to [extend the
parser](/pages/MBSP#customization) and need to make some modifications
to the output.
If you want to analyze the output (i.e. examine the relations between
words and groups of words), it is more convenient to construct a [parse
tree](/pages/MBSP#tree) from the output.
-----
## Tokenizer
MBSP includes a regular expressions-based tokenizer that divides
sequences of characters into words and sentences. Special care is given
to punctuation marks. We need to guess which punctuation marks are part
of a word (e.g. periods in abbreviations) and which mark word and
sentence breaks. Once tokenized, the [parser](/pages/MBSP#parser) can
then determine categories for words (tokens) in the string.
```python
MBSP.tokenizer.split(string, tags=False, citations=False, replace={}, ignore=[])
```
The `split()` command returns a list of
Unicode strings where punctuation marks have been split from words as
individual tokens. Each string in the list is a sentence. By default,
all SGML-tags (i.e. anything wrapped in \<...\>) are stripped from the
input string. With `citations=True`,
sentences in quoted citations are kept together. The
`replace` dictionary is used to map
unicode quotes and ellipsis to standard quotes and periods. The
`ignore` list defines ranges of words
that require special attention (for example, abbreviations and URL's –
see below).
For example:
```python
MBSP.tokenizer.split('The U.N. is considering banning "defamation of religion." The U.N. president said such a move would not limit free speech.')
[u'The U.N. is considering banning " defamation of religion . "',
u'The U.N. president said such a move would not limit free speech .']
```
### Tokenization process
In the simplest case, words are marked by spaces. A number of exceptions
are then handled:
- **Handle missing space**: punctuation inside a word can indicate a
missing space.
- **Handle punctuation**: punctuation marks at the head or tail of a
word can indicate a sentence break, the start of a citation, the
start of an explanation in parenthesis. In this case the punctuation
is split from the word. Punctuation that is part of the word (e.g.
*5p.m.*) needs to be kept intact.
- **Handle contractions**: apostrophes in contractions (*he's*) and
possessives (*father's*) mark word boundaries. The suffix is split
from the word.
- **Handle lists**: list item markers such as *1. 1) \* - a. a)* at
the start of a new line indicate the start of a new sentence, even
if the previous sentence did not end with a period.
- **Handle hyphenation**: hyphens at the end of a line mark words that
have been split across lines: *mar-* + *ket* becomes *market*,
*Great-* + *Britain* becomes *Great-Britain*.
- **Handle sentence breaks**: periods, exclamation marks, question
marks and ellipsis end a sentence if they are followed by a
capitalized letter. Parenthesis and quotes can be part of the
sentence even if the period precedes it.
### Tokenizer word ranges
The tokenizer defines a `Range` class for
matching sets of words: all abbreviations, all numeric strings, all
hyperlinks, etc. The `Range` class is a
list of known words enriched with regular expression patterns. In the
example below, an abbreviation range is created that matches *U.K.*,
*U.N.* and any single letter abbreviation (as in *T. De Smedt*).
```python
abbreviations = MBSP.tokenizer.Range(['U.K.', 'U.N.'])
abbreviations.patterns.append(re.compile('^[A-Za-z]\.$'))
print 'U.K.' in abbreviations, 'T.' in abbreviations
>>> True True
```
Custom ranges can be passed to the
`ignore` parameter of the
`split()` command.
By default, `split()` will use a list of
predefined ranges:
```python
MBSP.tokenizer.ignore = [abbreviations, numeric, URI, entities, biomedical]
```
Predefined ranges:
- **Abbreviations**: the simple rule is that every point is a sentence
break. This is 93.2% correct for Brown corpus
\[[1](http://bulba.sdsu.edu/~malouf/ling571/13-token-bw.pdf)\].
Fixing decimal points, single letter abbreviations, alternating
letters and capital letter followed by consonants (*Dept.*) improves
sentence break correctness to 97.7%. Additionally, the abbreviations
range defines a list of other well-known abbreviations.
- **Numeric**: matches anything starting with a digit followed by a
chain of digits and .,:/ separators. The range will also recognize
units of measurement (length, mass, volume, time, epoch,
temperature, storage capacity, data transfer rate, percentage):
US$100, 1.2MB, 31/12/2010, ...
- **URI**: matches URL's, links and e-mail addresses.
- **Entities**: matches HTML and Unicode entities.
- **Biomedical**: guesses biomedical-specific words, such as
*1',2'-trifenol*.
### Tokenizer modes
The tokenizer can be configured in different modes. By default it uses
Penn Treebank specifications for tokenization and uses special
corrections for biomedical use.
```python
MBSP.tokenizer.PENN_TREEBANK = True
MBSP.tokenizer.BIOMEDICAL = True
```
In Penn Treebank mode, only % is split from numbers and contractions are
not substituted: *won't* becomes *wo n't* instead of *will not*. In
biomedical mode, less magic is used when finding missing spaces so that
*3(R),3a(S),6a(R)-bis-tetrahydrofuranylurethane* is not split at the
comma or the parenthesis.
-----
## Lemmatizer
MBSP uses the MBLEM C module for word lemmatization. MBLEM (Memory-Based
Lematization) transforms a word form into a canonical lexical form:
*was* → *be*, *booking* → *book*.
```python
>>> print MBSP.lemmatize("The cats were sleeping.", tokenize=True)
the cat be sleep .
```
The goal of lemmatization is to provide better (generalizable) lexical
information that can be used by other components in the pipeline, and
(possibly) also for indexing the processed text in a search engine.
Lemmatization is not to be confused with *stemming*, in which frequent
suffixes such as *-ing* and *-ed* are simply removed: *having* is
stemmed to *hav*, while it is lemmatized to *have*.
### Lemmatization process
Words can have more than one lemma. The lemmatizer's task is to find the
most suitable:
- **Find lemmata**: determines for each word in a sentence which
lemmata it could be mapped to. The word *saw* is lemmatized to the
noun lemma *saw* and the verb lemma *see*. In case of producing more
than one lemma, the lemmatizer is unable to determine which lemma is
appropriate in the current sentence, as it does not use context.
- **Disambiguate**: consults the output of the part-of-speech tagger.
If the tagger has identified *saw* as a past-tense verb form
(VBD), MBLEM concludes that the
appropriate lemma is *see*, and selects this one as the correct
output. MBLEM therefore always generates a single lemma per word.
MBLEM is trained on the CELEX English lexical database, and will simply
retrieve the lemmatizations of words that occur in the database. Words
in the text that are not in CELEX (so-called "unknown" words, typically
constituting 5% of the words in a text) are lemmatized by analogy to
stored word lemmatizations.
Reference:
Van den Bosch, A., & Daelemans, W. (1999).
*Memory-based morphological analysis*, In: Proceedings of the 37th
annual meeting of the Association for Computational Linguistics on
Computational Linguistics, pp. 285-292.
-----
## PP-attacher
A shallow parsing approach sometimes has its shortcomings, an important
one being that prepositional phrases, which contain important semantic
information for interpreting events (e.g. *I sleep* vs. *I sleep
under a bridge*), are
left unattached. MBSP comes with a memory-based PP-attacher (MBPA)
trained on sections 2 through 21 of the Penn Treebank II Wall Street
Journal corpus (WSJ). It will link PNP
chunks to other “anchor” chunks.
With the PP-attacher enabled, performance is slower but you gain a lot
of useful information.
```python
s = MBSP.parse('I sleep under a bridge', anchors=True)
s = MBSP.split(s)
print(s[0].pnp)
print(s[0].pnp[0].anchor)
>>> [Chunk('under a bridge/PNP')]
>>> sleep
```
### PP-attachment process
A PNP-chunk can be attached to different
candidate chunks. However, there is a semantic difference between *I eat
pizza with olives* and
*I eat pizza
with olives*. Which one
is correct? And which one is correct here: *I
eat pizza
with a fork* or *I eat
pizza with a fork*?
- **Find prepositional noun phrases**: these are retrieved by a
regular expression-like algorithm. All
PP + NP
sequences are considered to be PNP’s.
Two exceptions: PP + **“** +
NP (e.g. *in “modest amounts”*) and
PP + VBG +
NP (*in making paper*) are also
considered PNP’s.
- **Classify**: the core of the PP-attacher is a memory-based
classifier. Candidate anchors are the
NP’s and
VP’s of the sentence that are not part
of the PNP itself. For example, *I eat a
pizza with olives* induces three classification tasks (*I-with,
eat-with*, *pizza-with*) in which the machine learner will have to
decide if the pair suggests a true anchor or not – taking into
account the distance to the candidate anchor and intermediary
punctuation, the number of intermediary
NP’s, and so on.
- **Heuristic decision making**: when the classifier identifies
multiple anchor candidates, an extra step is taken to pick one
unique anchor, using a baseline and entropy algorithm (details are
summarized in the reference paper).
In 50% of the cases a noun phrase (NP) is
chosen as anchor, in about 45% a verb phrase
(VP).
Reference:
Van Asch, V., & Daelemans, W. (2009).
*Prepositional Phrase Attachment in Shallow Parsing*. In: Proceedings of
the 7th International Conference on Recent Advances in Natural Language
Processing (RANLP), pp. 12-17.
-----
## Parse trees
A parse tree stores a parsed string as a network of linked Python
objects that can be traversed to analyze the sentences in the string.
The output of the [parser](/pages/MBSP#parser) can be passed to the
`split()` command, which produces a
`Text` object. Essentially, a
`Text` is a list of
`Sentence` objects. Each
`Sentence` consists of
`Word` objects.
`Word` objects are also grouped in
`Chunk` objects, which are related to
other `Chunk` objects in various ways.
```python
MBSP.split(parsed_string, token=[WORD, POS, CHUNK, PNP, RELATION, ANCHOR, LEMMA])
```
We'll run the sentence "*The cat sat on the mat.*" through the parse
tree:
```python
s = MBSP.parse('The cat sat on the mat.')
s = MBSP.split(s)
print(repr(s))
[Sentence(
"""The/DT/B-NP/O/NP-SBJ-1/O/the
cat/NN/I-NP/O/NP-SBJ-1/O/cat
sat/VBD/B-VP/O/VP-1/A1/sit
on/IN/B-PP/B-PNP/PP-CLR/P1/on
the/DT/B-NP/I-PNP/NP-CLR/P1/the
mat/NN/I-NP/I-PNP/NP-CLR/P1/mat
././O/O/O/O/.""")]
```
```python
print(s[0].chunks)
>>> [Chunk('The cat/NP-SBJ-1'),
Chunk('sat/VP-1'),
Chunk('on/PP-CLR'),
Chunk('the mat/NP-CLR')]
```
### Text
A `Text` is a list of
`Sentence` objects.
```python
text = MBSP.Text(parsed_string, token=[WORD, POS, CHUNK, PNP, REL, ANCHOR, LEMMA])
```
```python
text = MBSP.Text.from_xml(xml)
```
```python
text.string # 'The cat sat on the mat .'
text.sentences # [Sentence('The cat sat on the mat .')]
text.append(sentence)
text.copy()
```
```python
text.xml
```
Since `Text` behaves as a Python list it is
easy to traverse all the contained sentences:
```python
for sentence in text:
print(sentence)
```
### Sentence
A `Sentence` is a list of
`Word` objects, with attributes and
methods that organize words in `Chunk`
objects.
```python
sentence = MBSP.Sentence(string="", token=[WORD,POS,CHUNK,PNP,REL,ANCHOR,LEMMA])
```
```python
sentence = MBSP.Sentence.from_xml(xml)
```
```python
sentence.parent # Slices refer to the sentence they are part of.
sentence.id # Unique for each sentence.
sentence.start # 0
sentence.stop # 6
sentence.token # [WORD, POS, CHUNK, PNP, REL, ANCHOR, LEMMA]
```
```python
sentence.string # 'The cat sat on the mat .'
sentence.words # [Word('The/DT'), Word('cat/NN'), ... ]
sentence.lemmata # [u'the', u'cat', u'sit', u'on', u'the', u'mat', u'.']
sentence.tagged # [(u'The', u'DT'), (u'cat', u'NN'), ... ]
sentence.parts_of_speech # [u'DT', u'NN', u'VBD', u'IN', u'DT', u'NN', u'.']
```
```python
sentence.chunks # [Chunk('The cat/NP-SBJ-1'), Chunk('sat/VP-1'), ... ]
sentence.subjects # [Chunk('The cat/NP-SBJ-1')]
sentence.objects # []
sentence.verbs # [Chunk('sat/VP-1')]
sentence.relations # {'SBJ': {1: Chunk('the cat/NP-SBJ-1')},
# 'VP': {1: Chunk('sat/VP-1')},
# 'OBJ': {}}
```
```python
sentence.pnp # [Chunk('on the mat/PNP')]
sentence.anchors # [Chunk('sat/VP-1')]
```
```python
sentence.constituents(pnp=False)
```
```python
sentence.get(index, tag=LEMMA)
sentence.loop(tag1, tag2, ...)
sentence.indexof(value, tag=WORD)
```
```python
sentence.slice(start, stop)
sentence.copy()
```
```python
sentence.xml
sentence.nltk_tree()
```
- `Sentence.constituents()` returns an
in-order list of `Word` and
`Chunk` objects.
With `pnp=True`, also groups
into `PNPChunk` objects whenever
possible.
- Sentence.get() returns
the requested tag of the word at the given index.
The tag can be `WORD`,
`LEMMA`,
`POS`,
`CHUNK`,
`PNP`,
`RELATION`,
`ROLE`,
`ANCHOR` or a custom word tag.
- `Sentence.loop()` is an iterator over
a list of tuples containing the requested tags for each word.
- `Sentence.indexof()` returns the
indices of words for which the given tag equals the given value.
- For example, Sentence.indexof("NN\*",
tag=POS) returns a list of indices of the words whose
part-of-speech is NN,
NNS, NNP or
NNPS.
- `Sentence.slice() `returns a new,
partial sentence starting with the word at index
`start` and containing all the words
up to (before) index `stop`.
#### Creating a sentence
Normally, a sentence is constructed from the output of the
`parse()` command. Since this output is a
[TokenString](/pages/MBSP#extending) which stores the order in which
tags appear in a token, `Sentence` can
figure out how to construct a parse tree by itself. If you want to
construct sentences from a different source, you need to specify the
`token` parameter in the constructor, or
use `Sentence.append()` to add words
manually. If you have tokens in a slash-formatted string like MBSP (e.g.
"cats/NNS/I-NP/O/O/O/cat")
you can use `Sentence.parse_token()`,
which returns the arguments for
`Sentence.append()`.
```python
sentence.append(word,
lemma = None,
type = None,
chunk = None,
role = None,
relation = None,
pnp = None,
anchor = None,
iob = None,
custom = {})
```
```python
sentence.parse_token(token, tags=[WORD, POS, CHUNK, PNP, REL, ANCHOR, LEMMA])
```
For example:
```python
s = MBSP.Sentence('cats/NNS', token=[MBSP.WORD, MBSP.POS])
print(s.words)
>>> [Word('cats/NNS')]
```
```python
s = MBSP.Sentence()
s.append(word='cats', lemma='cat', type='NNS', chunk='NP')
print(s.words)
>>> [Word('cats/NNS')]
```
```python
s = MBSP.Sentence()
s.append(*s.parse_token('cats/NNS/I-NP/O/O/O/cat'))
print(s.words)
>>> [Word('cats/NNS')]
```
### Sentence words
A `Sentence` is made up of
`Word` objects, which are also grouped in
`Chunk` objects:
```python
word = MBSP.Word(sentence, string, lemma=None, type=None, index=0)
```
```python
word.sentence # Sentence object (e.g. 'The cat sat on the mat .')
word.index # 2
word.string # 'sat'
word.lemma # 'sit'
word.type # 'VBD'
word.chunk # Chunk('sat/VP-1').
word.pnp # None
```
```python
word.custom_tags # User-defined tags, e.g. {SENTIMENT: 'lazy'}
```
```python
word.tags() # ['sat', 'VBD', 'B-VP', 'O', 'VP-1', 'O', 'sit']
```
- `Word.type` is an alias for
`Word.part_of_speech`.
- `Word.pnp` is an alias for
`Word.prepositional_noun_phrase`.
- The `word.custom_tags` is a
dictionary of additional, user-defined tags that can occur when the
parser has been [extended](/pages/MBSP#customization). If the word
has (for example) an additional *sentiment* tag, it is also
available as `Word.sentiment`.
- The `Word.tags()` method returns a
list of token tags as they appear in the output of the parser.
The order of tags is determined by the
`Sentence.token` attribute.
### Sentence chunks
A `Chunk` is a list of
`Word` objects that belong together.
Chunks can be part of a `PNPChunk`, which
starts with a PP chunk followed by
NP chunks.
```python
chunk = MBSP.Chunk(sentence, words=[], type=None, role=None, relation=None)
```
```python
chunk.sentence # Sentence object (e.g. 'The cat sat on the mat .')
chunk.start # 0
chunk.stop # 2
```
```python
chunk.string # 'The cat'
chunk.words # [Word('The/DT'), Word('cat/NN')]
chunk.lemmata # ['the', 'cat']
chunk.tagged # [(u'The', u'DT'), (u'cat', u'NN')]
chunk.head # Word('cat/NN')
```
```python
chunk.type # 'NP'
chunk.role # 'SBJ'
chunk.relation # 1
chunk.relations # [(1, u'SBJ')]
chunk.related # [Chunk('sat/VP-1')]
chunk.subject # None
chunk.object # None
chunk.verb # Chunk('sat/VP-1')
chunk.modifiers # []
chunk.conjunctions # []
```
```python
chunk.pnp # None
chunk.anchor # None
chunk.attachments # [Chunk('on the mat/PNP')]
```
```python
chunk.append(word)
```
```python
chunk.previous(type=None) # None
chunk.next(type=None) # Chunk('sat/VP-1')
chunk.nearest(type="VP") # Chunk('sat/VP-1')
```
- `Chunk.head` yields the last (i.e.
principal) `Word` in the chunk.
- `Chunk.relations` contains *all* relations the chunk is involved in.
Some chunks (about 15%) have multiple relations, for example functioning as both SBJ and OBJ, or being the OBJ of multiple VP chunks.
- For VP chunks, `Chunk.modifiers is a list of nearby adjectives and adverbs with no relations.
For example: in '*the cat really wants out*', *really* and *out* are ADVP with no relations.
The parse tree will assume that they have something to do with the VP *wants*.
What does the cat want? → out.
How badly does the cat want out? → really.
- `Chunk.conjunctions is a list of chunks linked by "and" or "or" to this chunk.
For example: in '*going up and down*', the up chunk has conjunctions: \[(Chunk('down'), AND)\].
- `Chunk.pnp is an alias for the
parent `Chunk.prepositional_noun_phrase`.
- `Chunk.attachments contains related
prepositional noun phrases
- `Chunk.anchor references the chunk
that is the anchor of this PNP.
### Prepositional noun phrases
`PNPChunk is a subclass of
`Chunk, it has the same attributes and
methods.
It groups multiple chunks in a prepositional noun phrase
(PNP).
```python
pnp = MBSP.PNPChunk(sentence, words=[], type=None, role=None, relation=None)
```
```python
pnp.string # 'on the mat'
pnp.chunks # [Chunk('on/PP-CLR'), Chunk('the mat/NP-CLR')]
pnp.preposition # Chunk('on/PP-CLR')
pnp.anchor # Chunk('sat/VP-1')
```
```python
pnp.guess_anchor() # Returns the nearest VP chunk.
```
Words and chunks that are part of a PNP will
have their `Word.pnp and
`Chunk.pnp attribute set.
All the prepositional noun phrases in a sentence can be retrieved with
`Sentence.pnp.
-----
## Clients and servers
MBSP uses a client-server architecture so that the data only has to be loaded once.
From then on it is available as a server that can be contacted with the TCP protocol. MBSP has a `Client and a `Server class that communicate using the Python `socket module.
`Server is a wrapper around `subprocess.Popen and starts TiMBL or MBT as a background process.
MBSP starts four servers:
| | | | |
| -------------------------------------------------- | -------------- | --------------------------------------- | --------------------------------------------------------------------- |
| Name | Address | Process | Description |
| `MBSP.CHUNK | localhost:6061 | MBT | Server for finding part-of-speech and chunk tags. |
| `MBSP.LEMMA | localhost:6062 | TiMBL | Server used by [MBLEM](http://ilk.uvt.nl/mbma/) to find word lemmata. |
| `MBSP.RELATION | localhost:6063 | TiMBL | Server for finding verb-argument chunk relations. |
| `MBSP.PREPOSITION | localhost:6064 | TiMBL | Server for finding PNP anchors. |
**TiMBL**: memory-based learning is an elegantly simple and robust
machine-learning method applicable to a wide range of tasks in Natural
Language Processing.
**MBT**: a memory-based tagger-generator and tagger in one. MBT can, for instance, be used to generate part-of-speech taggers or chunkers for natural language processing.
### Client
The `Client base class can be used to contact a TiMBL or MBT server running at a given host address, at a given TCP communication port.
`Client is then subclassed with `Timbl and Mbt classes.
```python
client = MBSP.Client(
host = LOCALHOST,
port = 6060,
name = None,
log = False,
request = lambda v:v.strip()+'\n',
response = lambda v:v)
```
```python
client.name
client.host
client.port
client.connected # True after Client.connect() is called.
```
```python
client.connect()
client.send(request, timeout=None)
client.disconnect()
```
- `Client.connect() raises a
`ServerConnectionError if the server
can't be contacted.
- `ServerConnectionError.code can
contain additional information, such as
`CONNECTION_RESET_BY_PEER,
`CONNECTION_REFUSED or
`BROKEN_PIPE.
- `Client.send() prepares the request,
sends it to the server and returns the response.
`Client.send() raises a
`ClientDisconnectedError if
`Client.connect() is not called
first.
If `timeout` is given and no response
is returned in the given time, a
`ClientTimeoutError` is raised.
- Make sure to clean up with
`Client.disconnect() when the client
is no longer needed.
#### Request and response formatters
Note the `request` and `response` parameters in the
`Client` constructor. These are formatter
functions that will be called by
`Client.send()` to 1) prepare the data
before sending it to the server and 2) to clean up the server response
before returning it.
For example, if we send a raw request to the chunk server:
```python
>>> chunker = MBSP.Client(port=6061)
>>> print chunker.send('cat')
cat/NN/I-NP -utt-
```
The MBT server's response ends with a delimiter which we may want to
clean from the output.
This can be achieved by using a response formatter:
```python
>>> chunker = MBSP.Client(port=6061, response=lambda v: v.rstrip(' -utt-\n'))
>>> print chunker.send('cat')
cat/NN/I-NP
```
Note:
in reality, MBT's delimiter is a *utt* tag (so *utt* enclosed in \< and
\>) but in this example
we use *-utt-* instead to avoid the browser
thinking it's a HTML-tag.
### TiMBL and MBT client
MBSP has a `Timbl` and a `Mbt` class: subclasses of `Client`, with the right request and response formatters.
```python
timbl = MBSP.Timbl(host=LOCALHOST, port=6060, name=None, log=False, verbosity=[])
```
```python
mbt = MBSP.Mbt(host=LOCALHOST, port=6060, name=None, log=False)
```
TiMBL's output depends on the way the server *verbosity*
(`+v`) is configured. In MBSP it
typically it yields a category string:
"`CATEGORY {NP-SBJ} `". Depending on the
verbosity it can also yield different output. For example, verbosity
option `+v di+db `yields a **
category-distance-distribution string:
"CATEGORY {VP} DISTRIBUTION { VP 3.11506, n-VP
2.58232 } DISTANCE {2.02066} ".
The TiMBL client has an additional
`verbosity` parameter which can be used
to format the response:
```python
>>> timbl = MBSP.Timbl(port=6064, verbosity=['di','db'])
>>> print timbl.send('0 0 2 - eat VBP pizza NN with fork NN 1 0')
['VP', 2.0206599999999999, {'VP': 3.1150600000000002, 'n-VP': 2.5823200000000002}]
```
The server at port 6064 is the preposition server, which loads with
option `+v di+db`, so responses include
distance and distribution metrics. The TiMBL client in this example sets
the right verbosity options so that distance is correctly parsed as a
float, and distribution as a dictionary.
See the [TiMBL
manual](http://ilk.uvt.nl/downloads/pub/papers/Timbl_6.2_Manual.pdf) for
all verbosity options.
### Client batch requests
Since the servers are multithreaded, multiple request can be sent at the
same time to speed up the lookup process. MBSP's client module has a
`client.batch()` command to achieve this.
It takes a list of requests (i.e. instances) and a "client definition".
This definition is a tuple with the necessary parameters that allows the
`batch()` command to generate threaded
`Client` objects. The
`client.define()` command can be used to
generate the tuple.
```python
MBSP.client.batch(instances, client, timeout=None, retries=1)
```
```python
MBSP.client.define(client, host=LOCALHOST, port=6060, name=None, log=False)
```
For example:
```python
>>> from MBSP.client import batch, define, Mbt
>>> print batch(['cat', 'sat', 'mat'], client=define(Mbt, port=6061))
['cat/NN/I-NP', 'sat/VBN/I-VP', 'mat/NN/I-NP']
```
### Client request log
If `config.log=True`, all client requests
and server responses are logged. This is useful for evaluation. Also,
when logs are enabled, a cached request can be reused without having to
contact the server. The performance hit is negligible however – except
for very repetitive texts.
The log is a dictionary indexed by server name
(`CHUNK`,
`LEMMA`,
`RELATION` and
`PREPOSITION`).
Each individual server log in the dictionary is an ordered dictionary
that stores the last 1000 requests.
```python
MBSP.client.log
```
For example:
```python
>>> MBSP.config.log = True
>>> MBSP.parse('I eat pizza with a fork.')
>>> print MBSP.client.log[MBSP.RELATION]
{'-1 0 0 eat VBP - - - - - - - I PRP NP VBP VP ?': 'CATEGORY {NP-SBJ}\n',
'1 0 0 eat VBP - - - eat VBP VP - pizza NN NP NN PNP ?': 'CATEGORY {NP-OBJ}\n',
'2 0 0 eat VBP eat VBP VP pizza NN NP - fork NN PNP - - ?': 'CATEGORY {-}\n'}
```
Three requests were sent to the relation server to find the sentence
SBJ and OBJ.
More complex sentences require more lookup requests and more complex
request with more instance features require more time.
### Server
MBSP has a `Server` class that is used to
start TiMBL or MBT as a background process. A server can be queried by
creating a client. Typically, servers run locally on your own machine
(e.g. at "localhost") but they can also be configured to work with an
IP-address over a network. You can also create new servers from scratch
and work with their responses in Python code.
```python
server = MBSP.Server(
name,
host = LOCALHOST,
port = 6060,
process = TIMBL,
ping = None,
features = {})
```
```python
server.name # LEMMA
server.host # LOCALHOST
server.port # 6062
server.process # TIMBL
server.features # {'-f': '/models/em.data', '-m': 'M', '-w': 2, '-k-: 5}
server.ping # ('c = = = = = = = = = = = = = = = = = = = B ?\n',
# 'CATEGORY {ABB-X}\n')
```
```python
server.program # The shell command used to start the process.
server.pid # Process id, needed when stop() is called.
```
```python
server.started # True when ping request yields ping response.
```
```python
server.start(timeout=60)
server.stop()
```
- `Server.features` specifies the
training data file and various options (see the [TiMBL
manual](http://ilk.uvt.nl/downloads/pub/papers/Timbl_6.2_Manual.pdf)).` `
- `Server.start()` raises a
`ServerTimeoutError` when the server
doesn't start in the given time.
`Server.start()` will continually
check `Server.started` until it
returns True.
- `Server.started` determines if the
server is up and running by sending a ping request.
- `Server.ping` is a tuple with a
sample request and its desired (e.g. correct) response.
If it is None, `Server.started` just
sends "`x ?`" and accepts whatever
response.
Sending a sample request and validating the response is the only
safe way to figure out you are addressing the right server.
#### Stopping a server
When a server is stopped, the background process is killed. The
`Server.pid` process id is stored as a
temporary file so that MBSP can retrieve which servers are up and
running across Python sessions – you don't need to stop and restart a
server each time you run a new Python script.
If the process id is lost (for example, the system did a cleanup of
temporary files), the server background process will have to be killed
manually. The server module has a
`force_quit()` that terminates all
running TiMBL and MBT processes:
```python
MBSP.server.force_quit(processes=(TIMBL, MBT))
```
This only works on Unix-systems though. Otherwise, the TiMBL and MBT
processes will have to be terminated manually from the Windows Task
Manager or Mac OS X's Activity Monitor.
MBSP can also be [configured](/pages/MBSP#configuration) to
automatically stop its servers when Python exits.
-----
## Configuration
A number of settings in config.py can be adjusted. Most notably, IP
addresses for the [servers](/pages/MBSP#server)
(`CHUNK`,
`LEMMA`,
`RELATION`,
`PREPOSITION`) can be defined in the
`config.hosts` list, so MBSP can be
configured to work over a network.
By default, the servers will start automatically when MBSP is imported.
Servers are always started at localhost, i.e. the
`config.hosts` list contains the
addresses needed for the *clients*. So if your MBSP is set up to contact
servers over a network,
`config.autostart` should be set to False
so that no servers are started on your own machine. Servers can also be
configured to stop automatically when Python exits, by setting
`config.autostop`
to True.
The `config.log` option controls whether
client requests and the servers' responses are logged in the
`MBSP.client.log` dictionary.
### config.py options
-----
## Command-line interface
The parser can be run from the command-line by invoking mbsp.py inside
the MBSP library folder. The parsed output is printed in the terminal
window. Note that the servers will not be started automatically, so you
have to start them manually:
```
> cd MBSP
> python mbsp.py start
> python mbsp.py parse -f camelot.txt
> python mbsp.py parse -s "It's only a model."
> python mbsp.py parse xml -s 'It is a silly place.'
> python mbsp.py stop
```
### Command-line options
If no options are given a full parse is executed (i.e. tokenization,
tagging, chunking, relations, prepositions and lemmata). Otherwise, you
need to explicitly list every required option:
| | | |
| ------------------------------------ | -------------------------------------------- | ---------------------------------------------- |
| \-O | \--tokenize | Tokenize the input. |
| `-T | --tags ` | Parse part-of-speech tags. |
| `-C | --chunks ` | Parse chunks and PNP tags. |
| `-R | --relations` | Find verb/predicate relations. |
| `-A | --anchors` | Find PP-attachments. |
| `-L | --lemmata ` | Find word lemmata. |
| `-f | --file` | Input filename. |
| \-s | `--string ` | Input string. |
| `-e | --encoding` | Specify character encoding (utf-8 by default). |
| `-v ` | \--version | Current version of MBSP. |
Short options can be concatenated, e.g.:
python mbsp.py parse -OTL -f
camelot.txt
-----
## Extending MBSP
### Event framework
MBSP has a built-in event framework that offers a convenient way to
customize the module. For example, each completed step in the
[parser](/pages/MBSP#parser) fires an event. A user-defined event
handler can be assigned to an event, instead of messing around in the
source code.
Server event handlers take a `Server`
object as input. Events are fired when a server is registered in a
`Servers` group (see below), or when a
server is successfully started or stopped.
```python
def event_handler(server):
return None
```
```python
MBSP.events.server.on_register = event_handler
MBSP.events.server.on_start = event_handler
MBSP.events.server.on_stop = event_handler
```
Parser event handlers take a
`TokenString` object (see below) as input
and return the modified `TokenString`.
Events are fired when each step in the parsing process is completed (for
example, tokenization is done).
```python
def event_handler(tokenstring):
return tokenstring
```
```python
MBSP.events.parser.on_tokenize = event_handler
MBSP.events.parser.on_parse_tags_and_chunks = event_handler
MBSP.events.parser.on_parse_prepositions = event_handler
MBSP.events.parser.on_parse_relations = event_handler
MBSP.events.parser.on_parse_pp_attachments = event_handler
MBSP.events.parser.on_lemmatize = event_handler
```
### Customizing the parser
Suppose you need to correct some part-of-speech tags. Since the
part-of-speech tag influences the relation and anchor tag annotation,
you'd need to do this somewhere halfway down the parsing process. This
is possible by injecting an event handler – without modifying the source
code and thus keeping it easy to install package updates:
```python
>>> def my_tagger_chunker(tokenstring):
>>> T = tokenstring.split()
>>> i = T.tags.index(MBSP.PART_OF_SPEECH)
>>> for sentence in T:
>>> for token in sentence:
>>> print token[i] # modify POS-tag here
>>> return T.join()
>>>
>>> MBSP.events.parser.on_parse_tags_and_chunks = my_tagger_chunker
```
Since the output of the event handler is then processed further, it
needs to maintain the order of tags and it can't delete tags (they are
still needed by the parser to determine other tags).
### Customizing the parser output
The output of the `parse()` command is a
string of sentences in which each token has been annotated with the
requested tags. This string is in fact a
`TokenString` object that behaves as a
Python string, but with extra functionality. Most notably, it has a
`TokenString.split()` method that yields
a `TokenList` object: a list of
sentences, where each sentence is a list of tokens, in which each token
is a list of tags.
```python
tokenstring = MBSP.TokenString(string, tags=[WORD])
```
```python
tokenstring.tags # [WORD, POS, CHUNK, PNP, RELATION, ANCHOR, LEMMA]
tokenstring.split(sep=TOKENS) # Yields a TokenList.
tokenstring.copy()
```
```python
tokenlist = MBSP.TokenList(sentences, tags)
```
```python
tokenlist.tags
tokenlist.tags.insert(i, tag, values=None)
tokenlist.tags.append(tag, values=None)
tokenlist.tags.remove(tag)
tokenlist.tags.pop(i)
tokenlist.tags.swap(tag1, tag2)
```
```python
tokenlist.join() # Yields a TokenString.
tokenlist.copy()
tokenlist.reduce(tags=[])
tokenlist.filter(
word = None,
tag = None,
chunk = None,
relation = None,
anchor = None,
lemma = None)
```
The `TokenString` facilitates the
conversion from a tagged string to a list – where tokens can then be
manipulated. The `TokenList.tags` list
not only stores the order of tags in each token, it also keeps the
tokens in synch. If you delete an item from
`TokenList.tags`, all tokens in each
sentence are automatically modified as well. In the same way,
`TokenList.tags.append()` has an
additional `values` parameter where you
can enter the values of the new tag for each token in each sentence
( by default,
O).
For example:
```python
>>> s = MBSP.parse('I eat pizza')
>>> s = s.split()
>>> s.tags.append('semantic', values=[['person', 'action', 'food']])
>>> s = s.join()
>>> print s
I/PRP/I-NP/O/NP-SBJ-1/O/i/person eat/VBP/I-VP/O/VP-1/O/eat/action
pizza/NN/I-NP/O/NP-OBJ-1/O/pizza/food
```
When you create a [parse tree](/pages/MBSP#tree) from this output, the
*semantic* tag will end up in
`Word.custom_tags`.
If you want to insert a tag from one
`TokenString` into another, you can use
`TokenString1.tags.pop()`. It extracts
(and removes) the tag value from each token. You can use this return
value for the `values` parameter in the
other `TokenString2.tags.append()`.
`TokenList.reduce()` returns a copy with
the given tags removed from all tokens.
TokenList.filter() returns a list of all tokens that match the
given criteria. Wildcards can be used at the head, tail or middle of a
constraint:
```python
>>> s = MBSP.parse('I eat pizzas with a fork.')
>>> print s.split().filter(tag="NN*")
[[u'pizzas', u'NNS', u'I-NP', u'O', u'NP-OBJ-1', u'O', u'pizza'],
[u'fork', u'NN', u'I-NP', u'I-PNP', u'O', u'P1', u'fork']]
```
### Customizing the servers
MBSP's servers are grouped in a `Servers`
object: a Python list with some additional functionality:
- `Servers` can automatically start and
stop a [server](/pages/MBSP#server) (i.e.
`Server` object) that is added to the
group.
- `Servers` will check that all servers
in the group use a different port and a different name.
- ` `When a server is added, its
`Server.group` attribute will refer
to the `Servers` object.
- When a server is added, the `on_register` event will be fired.
```python
servers = MBSP.Servers(start=False, stop=False)
```
```python
servers.append(server)
```
```python
servers.[server_name] # Retrieve a server by name.
```
```python
servers.started # True when all servers in the group have started.
servers.start(timeout=60)
servers.stop()
```
The correct way to install a custom server is to add it to the module's
`active_servers` group to ensure that it
is started and stopped together with the other servers (e.g. when
`MBSP.stop()` is called).
Assume we have a tagger trained for biomedical use that overrides the
default part-of-speech tags:
```python
>>> import MBSP
>>> MBSP.active_servers.appe
MBSP is a text analysis system based on the [TiMBL](http://ilk.uvt.nl/timbl/) and [MBT](http://ilk.uvt.nl/mbt/) memory based learning applications developed at CLiPS and [ILK](http://ilk.uvt.nl/). It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment. The general English version of MBSP has been trained on data from the Wall Street Journal corpus. ![MBSP\_schema](/media/MBSP_schema.gif) ## Download
MBSP for Python (1.4) | download (.zip, 24MB)
Reference: Daelemans, W., & Van den Bosch, A. (2005). SHA256 checksum of the .zip: |
Tweet |
Option | Default | Description |
`MODULE` | `os.path.dirname(os.path.abspath(__file__))` | MBSP folder path. |
`LOCALHOST ` | `'localhost' ` | Localhost IP address. |
hosts | `[LOCALHOST, LOCALHOST, LOCALHOST, LOCALHOST]` | Server IP addresses. |
`ports` | `[6061, 6062, 6063, 6064] ` | Server ports. |
`autostart ` | `True ` | Automatically start servers? |
`autostop ` | `False ` | Automatically stop servers? |
`log ` | `False ` | Log server requests? |
`verbose` | `True ` | Server startup message? |
`paths ` | {'timbl': MODULE+'/timbl/Timbl'), 'mbt': MODULE+'/mbt/Mbt'), 'mblem': MODULE+'/mblem/mblem_english_bmt'), 'models': MODULE+'/models')} |
Binaries and training data. |
`threading` | `False` | See installation instructions. |
`encoding ` | `'utf-8' ` | Default character encoding. |
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。