UPGRADE
上传用户:seven77cht
上传日期:2007-01-04
资源大小:486k
文件大小:5k
- WWWOFFLE - World Wide Web Offline Explorer - Version 2.4a
- =========================================================
- WHAT?
- -----
- The format of the cache that WWWOFFLE uses to store the web pages has changed in
- version 2.x compared to the previous versions. If you have used WWWOFFLE
- version 1.x then you *MUST* upgrade the existing cache before you can use the
- new version of the program.
- HOW?
- ----
- *** READ ALL THIS SECTION BEFORE DOING ANYTHING ELSE ***
- When you compile WWWOFFLE there is another program called 'upgrade-cache' that
- is also compiled. You need to run this program to convert the cache from the
- old format to the new one.
- There are a number of options that you can take for this upgrade route, the
- following applies to all of them.
- In each of the options the basics are that you must run upgrade-cache and it
- takes an argument of the name of the cache directory that is used (usually
- /var/spool/wwwoffle). When the program runs it prints out informational and
- warning messages, these may be useful.
- Option 1 - Be reckless
- Run 'upgrade-cache /var/spool/wwwoffle', watch the messages go flashing by and
- hope that it works.
- Option 2 - Be brave
- With sh/bash run 'upgrade-cache /var/spool/wwwoffle > upgrade.log 2>&1'
- or with csh/tcsh run 'upgrade-cache /var/spool/wwwoffle >& upgrade.log'
- read the messages and check the warnings.
- Option 3 - Be safe
- Backup the cache first then follow option 2.
- With GNU tar I suggest that you use the --atime-preserve option so that the
- access times of the files in the cache are not modified by performing the
- backup. The index and purge options in WWWOFFLE use these so it is important.
- When it finishes, the multiple host named directories in /var/spool/wwwoffle are
- gone, moved into a new sub-directory called http. The outgoing directory and
- this http directory are the only directories that should be left.
- If there is a warning message then you should decide what needs doing. It could
- be any of the following reasons:
- That upgrade-cache was run by a user without write permissions.
- That one or more files were changed while the program was running.
- That there is a spare file in one of the host directories that needs deleting.
- That there is a symbolic link that does not point anywhere.
- If the upgrade-cache program crashes then that is a bug - tell me.
- If you are left with many files or directories and the warnings are unclear then
- this may be a bug - tell me.
- If there are only a small number of spare files or directories, then just delete
- them, you probably won't notice that they are missing.
- WHY?
- ----
- The existing scheme for naming of the files in the cache had some problems, the
- new one is better.
- 0) It was designed for my personal use which did not involve many web-pages
- stored and did not visit any pages with unusual names,
- You could say that the hacks that I implemented to get it working as I wrote
- it were not well enough thought out. But at the time I wrote it I wanted to
- get it working as soon as possible and did not write it with the future
- growth in mind. The scheme as implemented has not caused any problems for me
- personally.
- 1) It was possible for a web-page that has several possible arguments to be
- stored incorrectly.
- This is because for each page that has arguments a hash value is computed
- from the arguments to provide a unique filename. The reason for this failing
- is that I used a hash function that I made up on the spot, giving a 32-bit
- hashed value. This seemed to be sufficient for 4 billion sub-pages with the
- same path name for each host and path combination. As it turned out the hash
- function was not strong enough and the number of possibilities was much
- smaller.
- 2) There was no provision for any protocol other than http.
- Very quickly the idea of doing ftp as well came to my mind, but could not be
- implemented easily or cleanly with the current system.
- 3) The outgoing directory was inefficient for large numbers of files.
- An increasing sequence of numbers was used resulting in slow access, this was
- fixed in version 1.2x but there could still be many requests for the same URL
- in the directory. Now a unique name based on a hash is used so that only one
- request for each page is stored.
- 4) Bad characters and url-encoded URLs caused problems.
- Some URLs that had funny characters including URL-encoded sequences caused
- problems. The URL http://www.foo.com/~bar and http://www.foo.com/%7Ebar are
- the same URL but could be stored in different files.
- 5) It is now a neater design with no special cases.
- Previously only files with arguments needed hashing, now all of them use a
- hash, this simplifies the logic. The format of the outgoing directory is the
- same as the other directories.
- 6) There are more possibilities for future expansion.
- It is now possible to consider adding more files to the cache to store extra
- information about a URL, for example a password. It is obvious now that this
- would be another file with the same hash value but a different prefix.