资源说明:Transition-based parsing of word lattices
Transition-based parser for word lattices ==================================== by Benoit Favre2011-08-29 This program is a dependency parser using the transition-based framework by Nivre et al. that can process word lattices. For each path in the lattice, it produces the greedy 1-best sequence of transition, resulting in a parse tree. The output is represented as a minimal hypergraph containing all the parses. License: GPL v3 Dependences: java 1.5+ News ==== 2011-08-29: correct hypergraph output from tree-merger 2011-08-19: uniformized trainer and decoder, simplified data structures 2011-07-14: can parse macaon lattices in fr/en; can output a factorized tree (not a graph) 2011-07-13: factorize output using a subtree representation as hash keys 2011-07-12: lazy loading of the model (ModelOnDisk) 2011-07-08: level of functionality: can parse a lattice and output all parses need to come up with a storage methodology for the output Usage ===== train: java LatticeParser --train projective-trees-conll05 model predict: java LatticeParser model < input evaluate: java LatticeParser --eval model test-trees-conll05 To train a model, you need to provide a projective dependency treebank (non-projective trees can be projectivized with the malt parser). Training requires a lot of memory and a fair amount of time, depending on the size of the treebank. The parser uses liblinear to train a model that predicts parsing labeled shift-reduce transitions. The model is saved in a fast-loading binary format documented in FORMAT. For parsing lattices, you need a model trained with your parser and a lattice representation of your input. See Lattice file format for details. The output is a hypergraph representation of the input which includes arc ids to refer to the input. Arc ids are computed by topological sort. How does it work? ================= See doc.pdf (that you can generate with "pdflatex doc.tex") Ideas ===== * c implementation * incremental features * implement in malt paper: * present algorithm * sparse representation? * minimal tree automaton? * what about arc eager parser? * subtrees: explicitly factor tree during parse (because we have all subtrees): use input/chilren/label as hash key incremental: a given tree is already factorized * same for common ancestor tree * tree automata literature Lattice file format (input) =========================== The file format is defined as a list of arcs, plus optional (ignored) end states. word tag Hypergraph file format (output) =============================== Same as input format with additional -OR- label for alternation nodes in the hypergraph. id:word:tag label -OR- Notes ===== WARNING: liblinear fills the scores[] table with a different order than the label order. (file a bug report) How do we include scores? how many arcs? how many states can we process efficiently the resulting trees? => pattern matching => pruning paths => all parses through a word path; all words through a parse Projectivize FTB using malt =========================== java -jar malt-1.5.2/malt.jar -c pproj -m proj -i data/ftb6.train.conll05 -o data/ftb6.train.conll05.projective -pp head java -jar malt-1.5.2/malt.jar -c pproj -m proj -i data/ftb6.dev.conll05 -o data/ftb6.dev.conll05.projective -pp head java -jar malt-1.5.2/malt.jar -c pproj -m proj -i data/ftb6.test.conll05 -o data/ftb6.test.conll05.projective -pp head Examples ======== echo "Jean regarde l'homme qui mange une glace avec des jumelles." | txt2macaon | maca_segmenter | maca_tokenizer | maca_lexer | maca_tagger | macaon2htk -p | perl htk2fsm.perl | java -jar lattice-parser.jar ftb.model.binary | python2 fsm_draw.py echo "Jean likes to eat potatoes." | /data/demo-macaon/translation/maca_translate --text --model en-fr --prune 20 | maca_lexer -t -n -1 | maca_tagger -n 20 | macaon2htk -p | perl htk2fsm.perl | java -jar lattice-parser.jar ftb.model.binary | python view-hypergraph.py
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。