wikipedia-miner-fork-rev89
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Fork of Wikipedia Miner from Revision 89 (current trunk as of August 21, 2010)



	
		Wikipedia Miner - Readme

		
		
		
		
	
	
		

Wikipedia Miner Readme

Requirements

To run wikipedia miner, you will need lots of hard-drive space and around 3G of memory. On top of that, you will need...

  • Write access to a MySql Server
  • Java (1.5 or above)
  • MySql Connector/J—Java API for connecting to MySql databases.
  • Trove—Java API for efficient sets, hashtables, etc.
  • Weka—an open-source workbench of machine learning algorithms for data mining tasks.

If you only need Wikipedia's structure rather than it's full textual content, then you can save a lot of time by using one of our pre-summarized dumps (available here). Otherwise, you will need:

  • A copy of Wikipedia's content (one of the pages-articles.xml.bz2 files from here)
  • Perl
  • Parse:MediaWikiDump—Perl tools for processing MediaWiki dump files.

If you want to host your own Wikipedia Miner web services, then you will also need

Installation

  1. Set up a mysql server

    This should be configured for largish MyISAM databases. You will need a database to store the wikipedia data, and an account that has write access to it.

  2. If you are happy using the versions of Wikipedia that we have preprocessed and made available, then download one from SourceForge, and skip to step 5. Otherwise...

  3. Download and uncompress a wikipedia xml dump

    We want the file with current versions of article content, the one that ends with pages-articles.xml. Uncompress it and put it in a folder on its own, on a drive where you have lots of space.

  4. Tweak the extraction script for your version of Wikipedia

    This is only necessary if not working with the en (English) Wikipedia. Open up extraction/extractWikipediaData.pl in a text editor. Under the section called tweaking for different versions of wikipedia, modify the following variables:

    • @disambig_templates—an array of template names that are used to identify disambiguation pages
    • @disambig_categories—an array of category names that disambiguation pages belong to. This is only needed because a lot of people do this directly, instead of using the appropriate template
    • $root_category—the name of the root category, the one from which all other categories descend.

    WARNING: These scripts have only been tested on en and simple Wikipedias. If you get a language version working, then please post the details up on the sourceforge forum.

  5. Extract csv summaries from the xml dump

    The perl scripts for this are in the extraction directory.

    1. The main script, extractWikipediaData, does everything that most users will need (more on what it doesn't do in 4b). To run it, call perl extractWikipediaData <dir> where dir is the directory where you put the xml dump.

      You can supply an extra flag -noContent if you just want the structure of how pages link to each other rather than their full textual content. This takes up a huge amount of space, and you can do a lot (finding topics, navigating links, identifying how topics relate to each other, etc) without it.

      WARNING: this will keep the computer busy for a day or two; you can check out the forums to see how long it has taken other people. If you do need to halt the process then don't worry, it will pick up more-or-less where it left off.

      If the script seems to stall, particularly when summarizing anchors or the links in to each page, then chances are you have run out of memory. There is another flag -passes to split the data up and make it fit. The default is -passes 2; try something higher and see how it goes.

    2. The only csv file that the above script will not extract is anchor_occurance.csv, which compares how often anchor terms are used as links, and how often they occur in plain text. Chances are you won't want this—it's mainly useful for identifying how likely terms are to correspond to topics, so that topics can be recognized when they occur in plain text.

      This takes a very long time to calculate. Longer than extracting all of the other files. Fortunately it is easily parallelized. The following scripts are included so that you can throw multiple computers (or processors) at the problem, if you have them.

      • splitData.pl <dir> <n>

        splits the data found in dir into n seperate files. The files are saved as split_0.csv, split_1.csv, etc. within the provided directory.

      • extractAnchorOccurances.pl

        calculates anchor occurrences from a split file. The directory must contain one of the split files produced by splitData.pl, and the anchor.csv file created by extractWikipediaData.pl. The results are saved to anchor_occurance_<n>.csv.

      • mergeAnchorOccurances.pl <dir> <n>

        merges all of the intermediate results. The dir must contain all of the seperate anchor_occurance_<n>.csv files. The result is saved to <dir>/anchor_occurance.csv

  6. Import the extracted data into MySQL

    The easiest way to do this is via java—just create an instance of WikipediaDatabase with the details of the database you created earlier, and call the loadData() method with the directory containing the extracted csv files. This will do the work of creating all of the tables and loading the data into them, and will even give you information on how long it's taking. At worst this should take a few hours.

    Here is some code to get you going:

    //connect to database Wikipedia wikipedia = new Wikipedia(mysql_server, databaseName, mysql_username, mysql_password) ; //load cvs files File dataDirectory = new File("path/to/csv/directory/") ; wikipedia.getDatabase().loadData(dataDirectory, false) ; //prepare text processors wikipedia.getDatabase().prepareForTextProcessor(new CaseFolder()) ; //cache definitions (only worth doing if you will be using them a lot - will take a day or so) wikipedia.getDatabase().summarizeDefinitions() ;

    NOTE: You need the MySQLConnectorJ, Trove, and Wikipedia-Miner jar files in the build path to compile and run the java code. You may also need to increase the memory available to the Java virtual machine, with the -Xmx flag

    NOTE: The format of the csv files has changed recently. If you are using an older summarized dump you may need to run the patchWikipediaData.pl script (in the extraction directory). It takes the same parameters as the extractWikipediaData.pl script discussed above.

  7. Delete unneeded files

    Don't delete everything. Some of the csv files will be needed for caching data to memory, because its faster to do that from file than from the database. So keep the following files:

    • page.csv
    • categorylink.csv
    • pagelink_out.csv
    • pagelink_in.csv
    • anchor_summary.csv
    • anchor_occurance.csv
    • generality.csv

    You can delete all of the others. It might be worth zipping the original xml dump up and keeping that though, because they don't seem to be archived anywhere for more than a few months.

  8. Start developing!

    Hopefully the JavaDoc will be clear enough to get you going. Also have a look at the main methods for each of the main classes (Wikipedia, Article, Anchor, Category, etc) for demos on how to use them.

    Pop into the SourceForge forum if you have any trouble.

Web Services

If you need help on using any of the Wikipedia Miner web services, go here or append &help to any of the URLs.

If you want to host the Wikipedia Miner web services yourself, you will need to:

  1. Get Apache Tomcat up and running

    This is beyond the scope of this document, but there are lots of tutorials out there.

  2. Ensure that the Wikipedia Miner web directory is accessable via Tomcat

    This is the web directory within your Wikipedia Miner installation. Again, there are lots of tutorials out there to explain how to do this.

  3. Gather the appropriate jar files

    Place the jar files for Wikipedia Miner, Weka, Trove and MySql Connector/J into web/WEB-INF/lib/.

  4. Configure web.xml

    Edit web/WEB-INF/web.xml to specify the following context parameters:

    • server_path

      The url of the web directory (as it would be typed into a web browser)

    • service_name

      The name of the service (typically service, unless you change it)

    • mysql_server

      The name of the server hosting the mysql database (typically localhost)

    • mysql_database

      The name of the database in which Wikipedia Miner's data has been stored.

    • mysql_user

      The name of a mysql user who has read access to the database (you can leave this out if anonymous access is allowed)

    • mysql_password

      The password for the mysql user (you can leave this out if anonymous access is allowed)

    • data_directory

      The directory containing csv files extracted from a Wikipedia dump.

    • xslt_directory

      The directory containing xslt files for transforming xml responses into readable html (web/xsl).

    If you want users to be able to wikify URLs (the wikify service) or retrieve images from FreeBase (the define service) then you will have to tell Wikipedia Miner how to connect to the internet.

    • proxy_host
    • proxy_port
    • proxy_user
    • proxy_password

    If you want users to be able to wikify anything, then you will have to tell Wikipedia miner where to find models for disambiguation and link detection.

    • wikifier_disambigModel

      You can create one of these by building and saving a Disambiguator classifier (see the JavaDoc), or just use the one provided in models/disambig.model.

    • wikifier_linkModel

      You can create one of these by building and saving a LinkDetector classifier (see the JavaDoc), or just use the one provided in models/linkDetect.model.

    • stopword_file

      A file containing words to be ignored when detecting links (one word per line).

With Tomcat running, you should now be able to navigate to server_path in a web browser and see a page like this, listing all of the web services available.

Note: You will probably have to allocate more memory to Tomcat than usual, because these web services have to cache wikipedia's skelleton structure into memory. That's about 3G for the current english language version. To allocate this, modify the CATALINA_OPTS environment variable to include "-Xmx3G"

Tables and Summary Files

This section describes the summaries (csv files and database tables) that Wikipedia miner produces, in case you want to use them directly.

  • page

    lists all of the valid pages in the dump, along with their titles and types

    • id (Integer)
    • title (String)
    • type (1=ARTICLE, 2=CATEGORY, 3=REDIRECT, 4=DISAMBIGUATION)
  • redirect

    associates all redirect pages with their targets

    • from_id (Integer)
    • to_id (Integer)
  • categorylink

    associates all categories with their child categories and articles.

    • parent_id (Integer)
    • child_id (Integer)
  • pagelink

    associates all articles with the other articles they link to. Redirects are resolved wherever possible.

    • from_id (Integer)
    • to_id (Integer)
  • anchor

    associates the text used within links to the pages that they link to.

    • text (String, the anchor text found within the link)
    • to_id (Integer)
    • count (Integer, number of times this anchor is used to link to this destination)
  • disambiguation

    associates disambiguation pages with the senses listed on them.

    • from_id (Integer, the id of the disambiguation page)
    • to_id (Integer, the id of the sense page)
    • index (Integer, the position or index of this sense. Items mentioned earlier tend to be more obvious or well known senses of the term)
    • scope_note (String, the text used to explain how this sense relates to the ambiguous term)
  • equivalence

    associates categories and articles which correspond to the same concept

    • cat_id (Integer)
    • art_id (Integer)
  • generality

    lists the minimum distance (or depth) between a category or article and the root of Wikipedia's category structure. Small values indicate general topics, large values indicate highly specific ones.

    • id (Integer)
    • depth (Integer)
  • translation

    associates an article with all of the language links listed within it.

    • id (Integer)
    • lang (String, en, fr, jp, etc)
    • text (String, generally a translation of the article's title, in the given language)
  • content

    associates an article with it's content, in the original markup

    • id (Integer)
    • content (String, in raw mediawiki markup)
  • stats

    provides some summary statistics of the wikipedia dump

    • article_count (Integer)
    • category_count (Integer)
    • redirect_count (Integer)
    • disambiguation_count (Integer)
  • anchor_occurance

    lists the number of articles in which an anchor occurs as a link, vs the number of articles it occurs in at all.

    • anchorText (String)
    • link_count (Integer)
    • occ_count (Integer)

Licence

The Wikipedia Miner toolkit is open-source software, distributed under the terms of the GNU General Public License. It comes with ABSOLUTELY NO WARRANTY.


本源码包内暂不包含可直接显示的源代码文件,请下载源码包。