python-goose
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Html Content / Article Extractor, web scrapping lib in Python
Python-Goose - Article Extractor |Build Status|
===============================================

Intro
-----

Goose was originally an article extractor written in Java that has most
recently (Aug2011) been converted to a `scala project `_.

This is a complete rewrite in Python. The aim of the software is to
take any news article or article-type web page and not only extract what
is the main body of the article but also all meta data and most probable
image candidate.

Goose will try to extract the following information:

-  Main text of an article
-  Main image of article
-  Any YouTube/Vimeo movies embedded in article
-  Meta Description
-  Meta tags

The Python version was rewritten by:

-  Xavier Grangier

Licensing
---------

If you find Goose useful or have issues please drop me a line. I'd love
to hear how you're using it or what features should be improved.

Goose is licensed by Gravity.com under the Apache 2.0 license; see the
LICENSE file for more details.

Setup
-----

::

    mkvirtualenv --no-site-packages goose
    git clone https://github.com/grangier/python-goose.git
    cd python-goose
    pip install -r requirements.txt
    python setup.py install

Take it for a spin
------------------

::

    >>> from goose import Goose
    >>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'
    >>> g = Goose()
    >>> article = g.extract(url=url)
    >>> article.title
    u'Occupy London loses eviction fight'
    >>> article.meta_description
    "Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal."
    >>> article.cleaned_text[:150]
    (CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
    >>> article.top_image.src
    http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

Configuration
-------------

There are two ways to pass configuration to goose. The first one is to
pass goose a Configuration() object. The second one is to pass a
configuration dict.

For instance, if you want to change the userAgent used by Goose just
pass:

::

    >>> g = Goose({'browser_user_agent': 'Mozilla'})

Switching parsers : Goose can now be used with lxml html parser or lxml
soup parser. By default the html parser is used. If you want to use the
soup parser pass it in the configuration dict :

::

    >>> g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})

Goose is now language aware
---------------------------

For example, scraping a Spanish content page with correct meta language
tags:

::

    >>> from goose import Goose
    >>> url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
    >>> g = Goose()
    >>> article = g.extract(url=url)
    >>> article.title
    u'Las listas de espera se agravan'
    >>> article.cleaned_text[:150]
    u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s ciudad'

Some pages don't have correct meta language tags, you can force it using
configuration :

::

    >>> from goose import Goose
    >>> url = 'http://www.elmundo.es/elmundo/2012/10/28/espana/1351388909.html'
    >>> g = Goose({'use_meta_language': False, 'target_language':'es'})
    >>> article = g.extract(url=url)
    >>> article.cleaned_text[:150]
    u'Importante golpe a la banda terrorista ETA en Francia. La Guardia Civil ha detenido en un hotel de Macon, a 70 kil\xf3metros de Lyon, a Izaskun Lesaka y '

Passing {'use\_meta\_language': False, 'target\_language':'es'} will
forcibly select Spanish.


Video extraction
----------------

::

    >>> import goose
    >>> url = 'http://www.liberation.fr/politiques/2013/08/12/journee-de-jeux-pour-ayrault-dans-les-jardins-de-matignon_924350'
    >>> g = goose.Goose({'target_language':'fr'})
    >>> article = g.extract(url=url)
    >>> article.movies
    []
    >>> article.movies[0].src
    'http://sa.kewego.com/embed/vp/?language_code=fr&playerKey=1764a824c13c&configKey=dcc707ec373f&suffix=&sig=9bc77afb496s&autostart=false'
    >>> article.movies[0].embed_code
    '