ruby-readability
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Port of arc90's readability project to Ruby
Ruby Readability
================

Ruby Readability is a tool for extracting the primary readable content of a
webpage. It is a Ruby port of arc90's readability project.

Build Status
------------

[![Ruby](https://github.com/cantino/ruby-readability/actions/workflows/ruby.yml/badge.svg?branch=master)](https://github.com/cantino/ruby-readability/actions/workflows/ruby.yml)

Install
-------

Command line:

    (sudo) gem install ruby-readability

Bundler:

    gem "ruby-readability", :require => 'readability'


Example
-------

    require 'rubygems'
    require 'readability'
    require 'open-uri'

    source = open('http://lab.arc90.com/experiments/readability/').read
    puts Readability::Document.new(source).content


Options
-------

You may provide options to `Readability::Document.new`, including:

* `:tags`: the base whitelist of tags to sanitize, defaults to `%w[div p]`;
* `:remove_empty_nodes`: remove `

` tags that have no text content; also removes `

` tags that contain only images; * `:attributes`: whitelist of allowed attributes; * `:debug`: provide debugging output, defaults false; * `:encoding`: if the page is of a known encoding, you can specify it; if left unspecified, the encoding will be guessed (only in Ruby 1.9.x). If you wish to disable guessing, supply `:do_not_guess_encoding => true`; * `:html_headers`: in Ruby 1.9.x these will be passed to the `guess_html_encoding` gem to aid with guessing the HTML encoding; * `:ignore_image_format`: for use with .images. For example: `:ignore_image_format => ["gif", "png"]`; * `:min_image_height`: set a minimum image height for `#images`; * `:min_image_width`: set a minimum image width for `#images`. * `:blacklist` and `:whitelist` allow you to explicitly scope to, or remove, CSS selectors. Command Line Tool ----------------- Readability comes with a command-line tool for experimentation in `bin/readability`. Usage: readability [options] URL -d, --debug Show debug output -i, --images Keep images and links -h, --help Show this message Images ------ You can get a list of images in the content area with `Document#images`. This feature requires that the `fastimage` gem be installed. rbody = Readability::Document.new(body, :tags => %w[div p img a], :attributes => %w[src href], :remove_empty_nodes => false) rbody.images Related Projects ---------------- * [readability.cr](https://github.com/joenas/readability.cr) - Port of ruby-readability's port of arc90's readability project to Crystal * [newspaper](https://github.com/codelucas/newspaper) is an advanced news extraction, article extraction, and content curation library for Python. Potential Issues ---------------- If you're on a Mac and are getting segmentation faults, see the discussion at and consider updating your version of `libxml2`. Version 2.7.8 of `libxml2`, installed with `brew`, worked for me: gem install nokogiri -- --with-xml2-include=/usr/local/Cellar/libxml2/2.7.8/include/libxml2 --with-xml2-lib=/usr/local/Cellar/libxml2/2.7.8/lib --with-xslt-dir=/usr/local/Cellar/libxslt/1.1.26 Or if you're using bundler and Rails 3, you can run this command to make bundler always globally build `nokogiri` this way: bundle config build.nokogiri -- --with-xml2-include=/usr/local/Cellar/libxml2/2.7.8/include/libxml2 --with-xml2-lib=/usr/local/Cellar/libxml2/2.7.8/lib --with-xslt-dir=/usr/local/Cellar/libxslt/1.1.26 License ------- This code is under the Apache License 2.0. See . Ruby port by cantino, starrhorne, libc, and iterationlabs. Special thanks to fizx and marcosinger.


本源码包内暂不包含可直接显示的源代码文件,请下载源码包。