资源说明:A directional lightweighted crawler for scrapting product info
******************* Bee -- 小蜜蜂 ******************* :Author: ted@tcui.org :Created: 2010/03/24 :Source: README .. sectnum:: .. contents:: Table of Contents :depth: 2 如果想尽快体验“小蜜蜂”,请直接跳至 `Running the Bee`_ ============= Overview ============= Bee (小蜜蜂) 是一个灵活小巧的定向爬虫。具有高度的可配置性和扩展性。 与通用的网页爬虫相比,定向爬虫需要解决的问题有: - 定向 focus on certain areas of a site: you do not want to retrieve every single pages of the target site. Most likely, you only want to visit product catalog pages and product details pages - 爬行路径 crawling depth and crawling path: since you want to harvest complete product catalogs, the crawler will need to "page" through all catalog pages. In other words, the crawler can go very deep into the site. On modern shopping sites, the catalog pages are actually backed by "search". That means that there are hundreds of ways to represent the catalog. The crawler could get into infinite loops, if crawling path is not controlled carefully. The crawling process in Bee is controlled by Seeker class. Its sub-class RuleBasedSeeker is a pretty generic abstraction. You can define you crawling paths by feeding it some simple rules. In extreme cases, you may need to sub-class it to implement more sophisticated crawling controls. - 内容萃取 details extraction: Unfortunately, there is no good generic way to extract structural data from web pages. In the Bee package, the base class Miner does not really do anything, the developers need to develop dedicated Miners for each sites. The good news is that web sites do not change their layouts very often. If you write the Miner carefully, it can even tolerate minor changes. For example, all Taobao stores can share the same Miner class. There are many techniques can be used to extract information from web pages, for example, DOM, XPath, Regex, CSS selector etc. Since the Bee is a crawler, it also covers functions that are required for any good general purpose crawlers. - concurrency control: controlled by the number of worker thread, configurable in Job rules - rate control: simple "pause" can be configured in Job rules. - re-visit control: do not revisit the page if it has been visited in the past N seconds - HTTP related: USER_AGENT, proxy, authentication, GET/POST. These are handled by Fetcher, its sub-class SimpleHTTPFetcher is good for most HTTP/GET accesses. - encoding: you can relies on BeautifulSoup to detect the correct encoding, which often got wrong on Chinese sites. It is better to specify it in the Job rule configuration. For example, okaybuy.com.cn is a nasty one. It declares its encoding as "zh-CN", but it pages contain invalid characters in it. I have to patch BeautifulSoup and picked "gb2312" for it. - detecting site changes: It is actually quite difficult than it sounds. For small sites, we can afford to fully re-crawl them daily. For larger sites, the typical strategy is to run multiple crawler, one or a few more frequent small shallow crawling jobs plus one less frequent full crawl job. ================ Running the Bee ================ Environment Requirements ========================== Python -------- The Bee is developed and tested with Python 2.6.4 on Mac OS X (BSD). It should be able to run on any os with recent version of Python. The following instructions are for Linux alike environments. Shell commands should be adjusted accordingly if running under Windows. All command scripts were created with +x option, so they can be started with command line such as: :: ./taobao-crawler.py You can also specify the Python interpreter explicitly as: :: python taobao-crawler.py Python can be obtained from http://python.org Python is a very friendly language. Its syntax is very close to the pseudo languages that you have seen in many computer books. You can understand it perfectly without formally learning it. I started writing Python programs before reading any tutorials. As a matter of facts, I have never finished reading my "Learning Python" book :-) So, don't be scared First of all, please unpack and copy the package to your disk. Assume you dropped the package at $HOME/bee directory. PYTHONPATH ------------- Then you need to set up environment variable PYTHONPATH :: export PYTHONPATH=$HOME/bee Now, you can verify your installation by running one of the tests :: $ export PYTHONPATH=$HOME/bee $ cd ~/bee/tests $ ./bee-test-1.py If you see some lines without any errors, then you are good to go. Don't worry about the meaning of the messages yet. Start One Crawler ------------------ Then, lets try to crawl one of small Taobao store: :: $ cd ~/bee/crawlers/ $ ./taobao-crawler.py carephilly 2010-03-24 16:08:22,807 [INFO] [bee.py:1055] Starting up the crawling job 2010-03-24 16:08:22,809 [INFO] [bee.py:752] Worker started 2010-03-24 16:08:22,810 [INFO] [bee.py:752] Worker started 2010-03-24 16:08:22,810 [INFO] [bee.py:1058] Waiting job to be done 2010-03-24 16:08:22,810 [INFO] [bee.py:1062] idle_cnt: 0, Workers: 2, Tasks:(0, pending:1), Links:(0, succ:0, fail:0), Output: 0 2010-03-24 16:08:25,811 [INFO] [bee.py:1062] idle_cnt: 0, Workers: 2, Tasks:(1, pending:0), Links:(1, succ:0, fail:0), Output: 0 2010-03-24 16:08:28,811 [INFO] [bee.py:1062] idle_cnt: 1, Workers: 2, Tasks:(4, pending:64), Links:(3, succ:1, fail:0), Output: 0 2010-03-24 16:08:31,825 [INFO] [bee.py:1062] idle_cnt: 0, Workers: 2, Tasks:(4, pending:64), Links:(3, succ:1, fail:0), Output: 0 2010-03-24 16:08:34,826 [INFO] [bee.py:1062] idle_cnt: 0, Workers: 2, Tasks:(4, pending:64), Links:(3, succ:3, fail:0), Output: 0 2010-03-24 16:08:37,830 [INFO] [bee.py:1062] idle_cnt: 0, Workers: 2, Tasks:(10, pending:130), Links:(5, succ:3, fail:0), Output: 1 2010-03-24 16:08:40,853 [INFO] [bee.py:1062] idle_cnt: 0, Workers: 2, Tasks:(10, pending:130), Links:(5, succ:4, fail:0), Output: 1 2010-03-24 16:08:43,911 [INFO] [bee.py:1062] idle_cnt: 0, Workers: 2, Tasks:(10, pending:130), Links:(5, succ:5, fail:0), Output: 2 Now it start crawling the 'carephilly' 卡芙琳 store. When it is ended it will show :: 2010-03-24 16:18:01,631 [INFO] [bee.py:1062] idle_cnt: 2, Workers: 2, Tasks:(958, pending:0), Links:(132, succ:132, fail:0), Output: 120 2010-03-24 16:18:01,631 [INFO] [bee.py:1066] idle_cnt 3 reached max, prepare to stop 2010-03-24 16:18:01,631 [INFO] [bee.py:1076] preparing to stop job 2010-03-24 16:18:01,662 [INFO] [bee.py:767] Worker ended 2010-03-24 16:18:01,702 [INFO] [bee.py:767] Worker ended 2010-03-24 16:18:01,703 [INFO] [bee.py:1079] Job is done Accessing Taobao from my laptop is really slow, it took about 10 minutes to finish downloading the 120 products with 2 worker threads. Observing Progress -- job status line --------------------------------------- Every 3 seconds, the crawler prints a status line to show its progress: :: idle_cnt: 0, Workers: 2, Tasks:(10, pending:130), Links:(5, succ:5, fail:0), Output: 2 - idle_cnt: it becomes non-zero when the task queue becomes empty - Workers: the number of worker threads - Tasks: crawling tasks status. (Workers work these tasks :-) - the first number is the number of tasks that have been finished - pending: the number of tasks waiting to be done - Links: the status of link access - the first number is the number of link access attempts (mostly HTTP GET) - succ: the number of times that the contents of the link has been downloaded successfully - fail: the number of times that the crawler had difficulties to open the link - Output: the number of product descriptions that have been extracted since started While you are running the crawler, beside aforementioned status log lines, you can also observe the progress in other ways Observing progress -- xxx.prod.json ------------------------------------- First, you can tail the xx.prod.json file to watch what products that have been found by the crawler. There is one ascii_safe JSON string line for each product. It is more readable with the help of "print_json.py" script. :: $ tail -f taobao.carephilly.prod.json | ./print_json.py Line 1: { "prod_price": "188.00", "shipping_cost": 0.0, "url": "http://item.taobao.com/auction/item_detail-0db2-d6a34a97fc369709c69ccc0bacb80d2e.htm", "prod_img_s_3": "http://img01.taobaocdn.com/imgextra/i1/91892992/T244XaXblXXXXXXXXX_!!91892992.jpg_40x40.jpg", "sku_url": "http://item.taobao.com/spu-89741729-4453893344-1.htm", "prod_img_s_4": "http://img01.taobaocdn.com/imgextra/i1/91892992/T26NXaXbdXXXXXXXXX_!!91892992.jpg_40x40.jpg", "prod_details": [ { "attr_name": "货号", "attr_value": "货号:CL9905-2" }, { "attr_name": "品牌", "attr_value": "品牌:Caerphilly/卡芙琳" }, Observing Progress -- xxx.link.db and xxx.task.db ---------------------------------------------------- The crawler also creates xxx.link.db and xxx.task.db. They are sqlite3 database files. BTW, sqlite is used in every Firefox browser and iPod. So it is nothing new. It is pretty handy for small application. http://www.sqlite.org/docs.html The two databases are used to preserve job status. So you can stop and restart the crawler without losing progresses. To tail the latest tasks: :: sqlite3 taobao.carephilly.task.db "select * from tasks order by ROWID desc limit 5" The tasks table become empty when all crawling tasks finished. To tail the latest links that are discovered: :: sqlite3 taobao.carephilly.link.db "select * from links order by ROWID desc limit 5" To start other crawlers, please use following command :: ./taobao-crawler.py carephilly ./taobao-crawler.py maizhongfs.mall ./taobao-crawler.py gracegift ./okaybuy-crawler.py ./paixie-crawler.py 'paixie' has 11677 shoes. It takes a couple of hours with 2 workers running from US. It might be faster when running from China. You can also modify the configuration file to increase the number or workers. But please be polite to not bring down their site. Re-run the crawler ==================== The default policy allows to re-crawl the catalog pages not sooner than 4 hours; for product detail pages, the limit is 24-hours. Right after one pass of crawl, if you restart the crawler again, it will not do any thing and quite pretty soon. If you want to force it to re-crawl from scratch, please remove the xxx.db files that are owned by the crawlers. There are two db files for each crawlers, one for linkdb and another one for task queue. Logging =========== Feel free to put the crawler in background and pipe the log to log file. You can also alter the crawler script to point the log to file and adjust logging level. ================================================ Notes for Configuration and Future Developments ================================================ Integration ============== Currently, the default JSONDumper outputs the production description as JSON strings in text files. There are two ways to integrate the crawler with other parts of the system 1. Subclass the Output class, writes to the target storage directly 2. "tail" the json file, pipe to other applications that update the target storage. The second solution probably is easier to test and debug. One thing to be aware is that one product can be outputted multiple times depends on the re-crawl policy. The Bee does not detect page changes, it simply re-crawl the page when it is timed out. The downstream applications are responsible of determining whether to insert new products or update existing products. Writing new crawlers ===================== Writing new crawler is not difficult if you are familiar with html DOM, regex and Python. Step 1: determine crawling paths and Seeker Rules -------------------------------------------------- Typically, most sites can be crawled by "search all" or "search new goods". Take Taobao as an example, I picked "所有宝贝", because the stores are small, each store only has up to a few hundreds of products :: http://gracegift.taobao.com/?price1=&price2=&search=y&pageNum=1&scid=0&keyword=&orderType=_time&old_starts=&categoryp=&pidvid=&viewType=list&isNew=&ends= Then you determine the regex for "next pages" and product detail page. Then, write a simple test script to them try out, for example taobao-test-seeker.py paixie.net is pretty easy, its website is clean. okaybuy.com.cn is a little more difficult, it has non-shoe products and multiple search entry links on the page. I finally tight down the regex to only focus on code=001000000 and odrby=new. The initial seed is :: http://www.okaybuy.com.cn/list.php?code=001000000&odrby=new&curpage=1" The seeker rules are :: [ ['^http://www.okaybuy.com.cn/list.php\?code=001000000&odrby=new&curpage=\d+', 200, 14400, "simple_http_get", ["okaybuy_cat_seeker"], [], False,], ['^http://www.okaybuy.com.cn/com/\d+.html$', 201, 86400, "simple_http_get", [], ["okaybuy_item_miner"], False,], ] - The first rule means only follow "search shoes - order by new products - page N". - 200: allows the seeker to go down 200 hops deep. It is actually too high, since each search page has links to next 10 pages, 200 hops can exhaust 2000 search result pages. - 14400: do not revisit the same search result page withing 4 hours. - "simple_http_get": it is the name of Fetcher to be used in next step - ["okaybuy_cat_seeker"]: it defines that the next step only run one Seeker, "okaybuy_cat_seeker" on the page, - []: no Miner will be invoked. We do not want to run Miner on the search result page - False: it should continue evaluate the next Seeker rule - The second rule defines how to generate next crawl task for product detail pages - 201: it can go up to 201 hops deep. It should be at least 1 hop deeper than the max seeker hops. - 86400: only revisit the product detail page every 24 hours. - []: no Seeker will be invoked on product detail page - ["okaybuy_item_miner"]: only one Miner will be invoked. - False: continue evaluate next rule Step 2: Develop Miner ----------------------- Miner has be to developed for each site :-( Typically you can download the page to view the source. One thing to be aware is that the SimpleHTTPFetcher class does not execute JavaScript. You can use bee-test-0.py to dump the page. Alternatively, you can use "wget" or "curl". FireFox browser has a plug-in called FireBug. https://addons.mozilla.org/en-US/firefox/addon/1843 It helps you to find the corresponding html source for the any given elements on the page. It is a very useful to analyze how to extract the information. Then, you can start writing the Miner class by copying one of the existing Miner classes. The OkaybuyItemMiner in okaybuy.py is one of the examples. It is also recommended to write a simple test script to try it out. You can start from an empty Miner class, then add one attribute at a time. For example tests-okaybuy-seeker.py BTW, the Miner.extract() interface allows you to extract multiple products from one page. Step 3: write the Job Rule ---------------------------- The job rule glues all pieces together. I used a simple method, gen_job_rules(), in each crawler to construct Job rules. You can actually write them as external JSON file then load them when starting the crawling job. Typically, you can make a copy of one of existing one and then update "encoding", "seed_url" and "seeker_rules". Components in Job Rule `````````````````````````` "class_name" parameter defines which implementation class to be used for certain components. "params" defines optional parameters to be passed into the constructor of the aforementioned class. The whole crawling job is like a car, you can use different kinds of tires as long as they fit the interfaces. For example, "bee" provided two implementations for TaskQueue, DBTaskQueue and MemTaskQueue. MemTaskQueue is faster than DBTaskQueue. However, if you stop the crawler process in the middle, MemTaskQueue will lose all pending tasks that are still in queue. On the contrary, the DBTaskQueue uses sqlite database to persist the contents of the task queue, it allows you to stop and restart the crawler at any point of time. Similarly, you can implement TaskQueue based on MySQL server or some kinds of messaging queue, then you can actually run the crawling job from multiple machines. Seed Tasks ````````````` Seek tasks will be inserted into task queue when the job started. You can describe multiple seed tasks in that section. But in most cases, you only need one. On the other extreme end, if you only want to grab a few specific pages, you can define all of them as seed tasks, and use very small max_hop value in seeker rule, or only define Miner tasks. Step 4: Put Them Together ---------------------------- Create a new package file, such as taobao.py; put your Job Rule template and Miner class in it; write as simple start up script, such as crawlers/taobao-crawler.py Step 5: Test -------------- Run it. Observe if there are any errors. You may need to adjust your Miner class and Seeker rules. You can also lower the logging level to DEBUG to see more detailed information. You could also observe xxx.prod.json, xxx.link.db and xxx.task.db But most importantly, the job status log line. The first thing you want to avoid is that the crawler visited a lot of links but only find very few products. If the number of pending tasks grows too rapidly, it is also not a good sign. The best way to debug is to stop the crawler and check the content in link.db and task.db Once you have finalized your design, you can increase the number of workers based on the size of the site. Then reset link.db and task.db. Now, start your crawler and have fun :-) ============================ Design Notes for the Bee ============================ The structure of the Bee is quite simple. It is self-programming auto machine. TaskDescription ==================== Task is described by a dictionary :: { "url": url, "revisit_interval": revisit_interval, "fetcher": next_fetcher, "seekers": next_seekers, "miners": next_miners, "hop": hop, } It is stored as JSON string in the task queue. Tasks are created by you in the Job Rule as the seed tasks or by the Seeker. The Tasks are executed by Workers. TaskQueue ============ TaskQueue holds TaskDescription. There is only one instance of TaskQueue for one crawling job. If the TaskQueue instance is accessible from multiple machines, then the crawling job can be distributed onto multiple machines. - Worker pop up tasks from its head - Seeker generates new tasks to make the crawler go further into the site - Worker push new tasks into the TaskQueue on the behalf of the Seeker - Worker can also re-queue tasks for certain error conditions There are two implementations: MemTaskQueue and DBTaskQueue Worker ======== It is the driving force for the Bee. - It pops up tasks from the head of the TaskQueue - It checks the LinkDB to determine whether to visit/revisit the given url. - It also handles the error-defer-retry logic by re-queuing tasks. - It calls Fetcher to retrieve page. - It feeds the Page object to all Seekers and all Miners that are described by the task description. - It pushes new tasks that are generated by Seekers into the TaskQueue. - It writes products that are extracted by Miners into the Output. - It keeps looping until stop() is invoked Job can start multiple Worker instances on one or multiple machines. Fetcher ========= The Fetcher is responsible of accessing Internet. For given url, it returns Page object. The Bee provides one Fetcher implementation SimpleHTTPFetcher. Page ===== Page holds the data that has been retrieved from the Internet. The Bee provides one implementation, HTMLPage. - url: the final url that the HTTP request landed. - data: holds the raw data from the Internet - soup: holds the data that has been parsed by BeautifulSoup LinkDB ======== It holds status and access history of links. The Bee provides one implementation, SqliteLinkDB. Seeker ======== It is responsible of looking for new paths and generate new tasks. New Seeker tasks to crawler further or new Seeker tasks to extract products. The Bee provides one implementation, RuleBasedSeeker, which is driven by seeker rules. The format of seeker rule has been described in previous chapters. While the RulBasedSeeker is already very powerful, you may still need to develop new Seeker implementations for very difficult sites. Miner ======= The Miner extracts structured information from page. It has to be customized for each site. Output ======== It is the channel where the Worker pushes the final products. The Bee provides one implementation, JSONDumper. Job ===== - It reads the Job Rule. - Initializes all data structures: - TaskQueue - LinkDB - Output - FetcherFactory - SeekerFactory - MinerFactory - Feeds seed tasks into the TaskQueue - Creates Workers and start workers in worker threads - It then wait the job to be done, which is - TaskQueue empty - No more new tasks coming - Meanwhile, it print status lines to logging device when it has chance BeautifulSoup ================ It is a wonderful package for parsing HTML/XML. It can be found here http://www.crummy.com/software/BeautifulSoup/documentation.html There is one version included in the Bee package. It is a patched version that tolerates encoding errors.
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。