crawler - 源码 - 源码 - 免费下载

crawler

文件大小： unknow

源码售价： 5 个金币积分规则积分充值

资源说明：Simple java web crawler

# Java web crawler [![Build Status](https://secure.travis-ci.org/sitespeedio/crawler.png?branch=master)](http://travis-ci.org/sitespeedio/crawler)

Simple java (1.6) crawler to crawl web pages on one and same domain. If your page is redirected to another domain, that page is not picked up EXCEPT if it is the first URL that is tested. Basicly you can do this:

Crawl from a start point, defining the depth of the crawl and decide to crawl only a specific path
Output all working urls
Output the data to a csv file, separated by working (200 response code) and non working url
Output the data to two text files, one with working urls and one with none working. Each url will be on one new line.
Output url:s that contains a keyword in the html
Exprimental support for verifying that assets on a page work



## How to crawl

A simple crawl have the following options, and will output the url:s crawled to system out. Note, only urls that returns 200 will be outputted by default:
usage: CrawlToSystemOut [-l ] [-np ] [-p ] -u  [-v ]
 -l,--level             how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath   no url:s on this path will be crawled [optional]
 -p,--followPath         stay on this path when crawling [optional]
 -u,--url                 the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify           verify that all links are returning 200, default is set to true [optional] 
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional]



You can choose to output the crawled list to two plain text files, one with working urls, and one with the none working:
usage: CrawlToFile [-ef ] [-f ] [-l ] [-np ] [-p ] -u  [-v ] [-ve ]
 -ef,--errorfilename    the name of the error output file, default name is errorurls.txt [optional]
 -f,--filename               the name of the output file, default name is urls.txt [optional]
 -l,--level                     how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath           no url:s on this path will be crawled [optional]
 -p,--followPath                 stay on this path when crawling [optional]
 -u,--url                         the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify                   verify that all links are returning 200, default is set to true [optional]
 -ve,--verbose                verbose logging, default is false [optional]
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional] 



You can choose to output the result in a csv file, and separate the urls by working and non working:
usage: CrawlToCsv [-f ] [-l ] [-np ] [-p ] -u  [-v ]
 -f,--filename        the name of the csv output file, default name is result.csv [optional]
 -l,--level              how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath    no url:s on this path will be crawled [optional]
 -p,--followPath          stay on this path when crawling [optional]
 -u,--url                  the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify            verify that all links are returning 200, default is set to true [optional]
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional] 


Crawl and output urls that contains specific keyword in the html
usage: CrawlToPlainTxtOnlyMatching -k  [-l ] [-np ] [-p ] -u  [-v ]
 -k,--keyword          the keyword to search for in the page [required]
 -l,--level              how deep the crawl should be done, default is 1 [optional]
 -np,--notFollowPath    no url:s on this path will be crawled [optional]
 -p,--followPath          stay on this path when crawling [optional]
 -u,--url                  the page that is the startpoint of the crawl, examle http://mydomain.com/mypage
 -v,--verify            verify that all links are returning 200, default is set to true [optional]
 -rh,--requestHeaders    the request headers by the form of header1:value1@header2:value2 [optional] 



## Configuration
There are also configuration that you either configure in the crawler.properties file or override them by adding them as a system property. By default they are configured:
## Override these properties by set a system property
com.soulgalore.crawler.nrofhttpthreads=5
com.soulgalore.crawler.threadsinworkingpool=5
com.soulgalore.crawler.http.socket.timeout=5000
com.soulgalore.crawler.http.connection.timeout=5000
# Auth like:
# soulislove.com:80:username:password,...
com.soulgalore.crawler.auth=
# Proxy properties, if you are behind a proxy.                                                                                                                                                          
## The host by this special format: http:proxy.soulgalore.com:80                                                                                                                                        
com.soulgalore.crawler.proxy=

The location of crawler.properties file can be set with the system property com.soulgalore.crawler.propertydir.

## Examples

Checkout the project and compile your own full jar (all dependencies included):
git clone git@github.com:soulgalore/crawler.git

or add it to Maven, if you want to include the crawler in your project:
<dependency>
 <groupId>com.soulgalore</groupId>
 <artifactId>crawler</artifactId>
 <version>1.5.11</version>
</dependency>


## Examples

Running from the jar, fetching two levels depth and only fetch urls that contains "/tagg/"
java -jar crawler-1.5.11-full.jar -u http://soulislove.com -l 2 -p /tagg/


Running from the jar, adding base auth
java -jar -Dcom.soulgalore.crawler.auth=soulgalore.com:80:peter:secret crawler-1.5.11-full.jar -u http://soulislove.com


Running from the jar, output urls in csv file
java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlToCsv -u http://soulislove.com


Running from the jar, output urls into two text files: workingurls.txt and nonworkingurls.txt
java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlToFile -u http://soulislove.com -f workingurls.txt -ef nonworkingurls.txt


Running from the jar, verify that assets are ok
java -cp crawler-1.5.11-full.jar com.soulgalore.crawler.run.CrawlAndVerifyAssets -u http://www.peterhedenskog.com


## License

Copyright 2014 Peter Hedenskog

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.


[![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/soulgalore/crawler/trend.png)](https://bitdeli.com/free "Bitdeli Badge")

部分文件列表（点击文件名可查看文件内容）

					
									本源码包内暂不包含可直接显示的源代码文件，请下载源码包。