资源说明:A tool for calculation semantic similarity between words from a text corpus based on lexico-syntactic patterns.
PatternSim ========== A tool for calculation semantic similarity between words from a text corpus based on lexico-syntactic patterns. - Currently, the tool consist of two separate programs -- *patternsim* and *patternsim-rank* (see below). - This tool implements the extraction method described in these papers: - Panchenko A., Morozova O., Naets H. “A Semantic Similarity Measure Based on Lexico-Syntactic Patterns.” In Proceedings of the 11th Conference on Natural Language Processing (KONVENS 2012), — Vienna (Austria), 2012 - http://www.oegai.at/konvens2012/proceedings/23_panchenko12p/ - Kristina Sabirova, Artem Lukanin. Automatic Extraction of Hypernyms and Hyponyms from Russian Texts // Supplementary Proceedings of the 3rd International Conference on Analysis of Images, Social Networks and Texts (AIST 2014) / Ed. by D. I. Ignatov, M. Y. Khachay, A. Panchenko, N. Konstantinova, R. Yavorsky, D. Ustalov. Vol. 1197: Supplementary Proceedings of AIST 2014. CEUR-WS.org, 2014. С. 35-40. - http://ceur-ws.org/Vol-1197/paper6.pdf - A demo of the extraction results provided with this method can be accessed here: http://serelex.cental.be/ - Related repositories: - Source code of the demo system: https://github.com/PomanoB/lsse - An evaluation framework for semantic similarity measures: https://github.com/alexanderpanchenko/sim-eval License ------- LGPLv3: http://www.gnu.de/documents/lgpl-3.0.en.html patternsim ---- A tool for extraction of raw extraction counts with lexico-syntactic patterns. **Requirements** - Perl 5.14.x or higher - Unitex 3.0beta (http://www-igm.univ-mlv.fr/~unitex/) **Installation on Ubuntu 12.04** 1. Install Unitex 3.0beta (http://www-igm.univ-mlv.fr/~unitex/zips/Unitex3.0beta.zip) 2. Install cpanm: "sudo cpan App::cpanminus" 3. Install all dependencies: "sudo cpanm --installdeps ." **Quick Start** Use *./rerank.sh* to rerank relations with the default formula, and as an example of usage of patternsim-rank. **Synopsis** patternsim [options] [corpus_file(s) ...] **Options** Usage: patternsim [options] [corpus_file(s) ...] Mandatory options: --unitex Unitex main directory --output (-o) output directory Options: --vocabulary (-v) input vocabulary file --workers (-w) number of workers --language (-l) language --list-languages list all available languages --verbose verbose mode --help brief help message --man full documentation Options: --unitex *unitex_main_directory* Specify the Unitex main directory if you want to use your own Unitex installation (overwite the patternsim configuration file) --output -o *output_directory* Specify the output directory. --vocabulary --vocab -v *vocabulary_file* Specify the UTF-8 input vocabulary file (one word per line) --workers -w *number_of_workers* Specify the number of parallel workers Workers will extract in parallel semantic relations. A good number of workers will be the number of CPU cores minus 1. --language -w *language_id* Specify the current language --list-languages Show all available languages (language_id and full name) --verbose Explains what is being done --help -h Prints a brief help message and exits. --man Prints the manual page and exits. --verbose Activates the verbose mode. Explains all the processes. Outputs will be shown on stderr. **Example** ./patternsim --unitex /home/user/Unitex3.0beta -v vocabulary.txt -o output corpus.txt The output of this command -- a set of files in the directory "./output": - *conc-freq.csv* -- a frequency list derived from a set of extraction concordances - *corpus-freq.csv* -- a frequency list derived from an input corpus "corpus.txt" - *pairs.csv* -- similarity matrix containing raw extraction counts between all single words - *pairs-np.csv* -- similarity matrix containing raw extraction counts between all noun phrases - *pairs-voc.csv* -- similarity matrix containing raw extraction counts between terms from the input vocabulary "vocabulary.txt" The files *conc-freq.csv* and *corpus-freq.csv* are CSV files in the following format: ``` word;frequency\n ``` The files *pairs.csv*, *pairs-np.csv* and *pairs-voc.csv* are CSV files in the following format: ``` target-word;relatum-word;e-syno;e-cohypo;e-hyper-hypo;e-hyper;e-hypo;e-all;e1;e2;e3;е4;е5;е6;е7;е8;е9;е10;е11;е12;е13;е14;е15;е16;е17\n ``` Here *target-word* and *related-word* are words, ' *e-all* is the number of extractions between *target-word* and *relatum-word* with all the 17 patterns, *ei* is number of extractions between *target-word* and *relatum-word* with the *i*-th pattern (see the referenced above paper for details). Thus *e-all* = sum_*i* (*ei*). *e-syno*, *e-cohypo*, *e-hyper*, *e-hyper-hypo*, *e-hypo* is the number of specific relations extracted between terms (synonyms, co-hyponyms, hypernyms, hyponym, hypernyms+hyponyms). **Corpus** Here are some corpora which you may use with this tool: - Some Wikipedia articles: http://cental.fltr.ucl.ac.be/team/~panchenko/patternsim/corpus/ - For even bigger corpora use ukWaC and WaCkypedia: http://wacky.sslmit.unibo.it/doku.php?id=corpora - Use DBPedia dump of Wikipedia: http://wiki.dbpedia.org/Downloads - Use a corpus of your own **Russian morphological dictionary** The Russian dictionary in this repository is an extract of the Russian computational morphological dictionary developed at CIS, Munich. This extract contains about 15% of the original dictionary (the most frequent lemmata). The whole dictionary actually contains 140,000 simple entries (= 2.7 million distinct forms), 166,000 simple proper nouns (= 900,000 distinct forms) and 1800 compound words. If you want to use the full version of the lexicon, please contact: Sebastian Nagel CIS Oettingenstr. 67 80538 München Germany wastl@cis.uni-muenchen.de http://www.cis.uni-muenchen.de For additional information see: Nagel, Sebastian 2002: Formenbildung im Russischen. Formale Beschreibung und Automatisierung für das CISLEX-Wörterbuchsystem (http://www.cis.uni-muenchen.de/~wastl/pub/ruslex.pdf) For a short description (in German), see http://www.cis.uni-muenchen.de/~wastl/pub/ruslexUnitex.pdf rank --------------- Reranking semantic similarity scores between words extracted with the patternsim. Directory -- "rank". **Synopsis** patternsim-rank [*options*] **System Requirements** - Windows -- Microsoft .NET framework 4.0 or higher (http://www.microsoft.com/net). - Linux or Mac OSX -- Mono 2.0 or higher (http://www.go-mono.com/mono-downloads/). For instance, for Ubuntu 12.04 use "sudo apt-get install mono-runtime". - At least 4Gb of RAM is recommended. **Binaries** Binaries are readily available the bin folder. On Unix based systems you may use "./patternsim-rank" or "./patternsim-rank.exe". On Windows, use "patternsim-rank.exe". **Testing** 1. Download test data http://cental.fltr.ucl.ac.be/team/~panchenko/sim-eval/patternsim-rank-data.tgz. 2. Save the archive to the "rank" directory. 3. Extract the data (tar xzf patternsim-rank-data.tgz). The directory "data" should appear. 4. Run tests.sh script. It will produce the output in the data/output folder. **Recompilation** 1. Open patternsim-rank.sln with MonoDevelop or Visual Studio. 2. Build the solution. **Options** *p, pairs* Required. An UTF-8 encoded CSV file in provided by the PattenSim program. In the format: ``` target;relatum;syno;cohypo;hyper_hypo;hyper;hypo;sum;pattern;pattern2;pattern3;pattern4;pattern5;pattern6;pattern7;pattern8;pattern9;pattern10;pattern11;pattern12;pattern13;pattern14;pattern15;pattern16;pattern17 ``` This file must contain symmetric relations between words (generated by the PatternSim by default). If there exist a relation 'target;relatum;type;sim' then there should exist one and only one relation 'relatum;target;type;sim' in the same file. *o, output* Required. An UTF-8 encoded CSV file 'target;relatum;sim', where 'sim' is similarity score between 'target' and 'relatum'. This file is sorted by 'target' and then 'sim'. *c, corpusfreq* Required. An UTF-8 encoded CSV file 'word;freq' with frequencies of words. *t, type* Required. Type of reranking: 1. Efreq, no reranking, transform scores to the interval [0;1]. 2. Efreq-Rfreq, reranking by frequency of relations to other words. Uses option 'alpha'. 3. Efreq-Rnum, reranking by number of relations to other words. Uses option 'beta'. 4. Efreq-Cfreq, reranking by word frequency. Uses option 'corpusfreq'. 5. Efreq-Rnum-Cfreq, reranking by number of relations to other words and by word frequency. Uses options 'beta' and 'corpusfreq'. 6. Efreq-Rnum-Cfreq-Pnum, reranking by number of relations to other words, by word frequency and by number of different patterns extracted the relations. Uses options 'corpusfreq', 'patterns', 'beta' and 'sqrt'. *a, alpha* Expected number of relations per word, default -- 15. *b, beta* Minimum number of extractions which establish a relation between words, default -- 2. *s, sqrt* Sqrt of the number of different patterns, default -- true.