资源说明:A tool for parsing Australian government websites into a contact database
This project is an effort to parse the contents of Australian parliament websites into a database suitable for mail merging, filtering, etc. Information for users --------------------- To use this tool, you will need to the following components installed: * python 2.6 * python-httplib2 * python-beautifulsoup (for some parsers) While this tool is designed to work on any operating system, it has only been tested on Linux. If you run into compatibility problems, please file a bug report as described below. To check the command line arguments for this tool run: ./polilist --help More comprehensive information will be added to this README file as the tool matures. If you come across a bug or would like a feature added please request this at https://github.com/Smattr/polilist/issues. Information for developers -------------------------- For now, while this project is in its early stages this README file will contain a basic overview of the proposed design of this system. Please follow these guidelines during development: - Unless you have a very good reason, always use the hardcache provider during development. If not, you will quickly upset website admins. - Don't commit uncommented code. I don't want to have to have to divine the meaning of your code. - If you are making a major architectural change, please speak to me about it before committing. - If you can see problems in the design that will cause issues down the road, please flag them with me as soon as possible. - If you come across an issue that you can't (or don't want to) fix, please raise it at https://github.com/Smattr/polilist/issues. - If possible, make sure your changes don't break any of the tests (ensure tests/runtests executes correctly). Along with these, follow general DVCS-project politeness (informative commit messages, <80 characters in first line of commit message, don't create remote heads, don't commit code that doesn't compile, etc.). If you're about to do something that you think might upset people, ask them before pushing. Design ------ Apologies for the confusing manner in which this is described. It's not very clear in my head right now. +-----------+ ++-|Base Parser| +----+ || +-----------+ +-------------+<-------|Main|------+ +--------+ |Base Provider| +----+ +---->|Parser 1| +-------------+ | | +--------+ / \ | | | / \ | | +--------+ / \ | +---->|Parser 2| +----------+ +----------+ | +--------+ |Provider 1| |Provider 2| | +----------+ +----------+ | | \ / +-----------+ |Base Output| +-----------+ / \ / \ / \ +----------+ +-------------+ |CSV Output| |SQLite Output| +----------+ +-------------+ The main logic of this tool has a provider (an object with the ability to retrieve documents over HTTP). The reason for having more than one type of provider is to allow different providers to implement different caching policies; in particular to allow a provider with a very lazy cache coherence protocol for use in testing so as not to overload websites with unnecessary HTTP requests. Main has a set of parsers, each specific to a given webpage. Each parser encapsulates the format of a page by providing a function to retrieve the contacts on that page (or pages linked from there). In this way, Main should just be able to iterate through its known parsers and call each one. Main has a single Output that it uses when it comes time to write out the contact information it has discovered. Each Output defines a format for the file(s) it writes. That should all be as clear as mud now. Testing ------- The directory tests contains all the unit and regression tests for this project. To run them, execute the script runtests in this directory. Note that running the tests may cause some data to be downloaded and you should expect some tests to fail if you do not have an available connection to the internet. If you look at runtests, you will see that it just executes every executable file in the subdirectories. To create a new test just create an executable file (can be any interpreted language) in one of the subdirectories. To decide where your new test should live, refer to the following list: - tests/integration Tests that only interact with the project through its external interfaces. These are tests designed to exercise the project functionality that is exposed to end users. - tests/regression Tests that prove the absence of some previously observed bug. The purpose of these tests is to ensure a bug that has been fixed in the past is not reintroduced by new changes. When closing a bug you should also commit a regression test that validates the fix. - tests/unit Tests that probe a feature of a specific module in isolation. These do not have to run through the same setup as the main program logic. Links ----- Directory of NSW local councils: http://www.dlg.nsw.gov.au/dlg/dlghome/dlg_localgovdirectory.asp
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。