gb_io.w
上传用户:rrhhcc
上传日期:2015-12-11
资源大小:54129k
文件大小:21k
- % This file is part of the Stanford GraphBase (c) Stanford University 1993
- @i boilerplate.w %<< legal stuff: PLEASE READ IT BEFORE MAKING ANY CHANGES!
- deftitle{GB_,IO}
- @* Introduction. This is {sc GB_,IO}, the input/output module used
- by all GraphBase routines to access data~files. It doesn't actually do
- any output; but somehow `input/output' sounds like a more useful title
- than just `input'.
- All files of GraphBase data are designed to produce identical results on
- almost all existing computers and operating systems. Each line of each file
- contains at most 79 characters. Each character is either a blank or a
- digit or an uppercase letter or a lowercase letter or a standard punctuation
- mark. Blank characters at the end of each line are ``invisible''; that is,
- they have no perceivable effect. Hence identical results will be obtained on
- record-oriented systems that pad every line with blanks.
- The data is carefully sum-checked so that defective input files have little
- chance of being accepted.
- @ Changes might be needed when these routines are ported to different
- systems. Sections of the program that are most likely to require such changes
- are listed under `system dependencies' in the index.
- A validation program is provided so that installers can tell if {sc GB_,IO}
- is working properly. To make the test, simply run .{test_io}.
- @(test_io.c@>=
- #include "gb_io.h"
- /* all users of {sc GB_,IO} should include this header file */
- #define exit_test(m) /* we invoke this macro if something goes wrong */
- {@+fprintf(stderr,"%s!n(Error code = %ld)n",m,io_errors);@+return -1;@+}
- @t2@>@/
- int main()
- {
- @<Test the |gb_open| routine; exit if there's trouble@>;
- @<Test the sample data lines; exit if there's trouble@>;
- @<Test the |gb_close| routine; exit if there's trouble@>;
- printf("OK, the gb_io routines seem to work!n");
- return 0;
- }
- @ The external variable |io_errors| mentioned in the previous section
- will be set nonzero if any anomalies are detected. Errors won't occur
- in normal use of GraphBase programs, so no attempt has been made to
- provide a user-friendly way to decode the nonzero values that
- |io_errors| might assume. Information is simply gathered in binary
- form; system wizards who might need to do a bit of troubleshooting
- should be able to decode |io_errors| without great pain.
- @d cant_open_file 0x1 /* bit set in |io_errors| if |fopen| fails */
- @d cant_close_file 0x2 /* bit set if |fclose| fails */
- @d bad_first_line 0x4 /* bit set if the data file's first line isn't legit */
- @d bad_second_line 0x8 /* bit set if the second line doesn't pass muster */
- @d bad_third_line 0x10 /* bit set if the third line is awry */
- @d bad_fourth_line 0x20 /* guess when this bit is set */
- @d file_ended_prematurely 0x40 /* bit set if |fgets| fails */
- @d missing_newline 0x80 /* bit set if line is too long or |'n'| is missing */
- @d wrong_number_of_lines 0x100 /* bit set if the line count is wrong */
- @d wrong_checksum 0x200 /* bit set if the checksum is wrong */
- @d no_file_open 0x400 /* bit set if user tries to close an unopened file */
- @d bad_last_line 0x800 /* bit set if final line has incorrect form */
- @ The CEE/ code for {sc GB_,IO} doesn't have a main routine; it's just a
- bunch of subroutines to be incorporated into programs at a higher level
- via the system loading routine. Here is the general outline of .{gb_io.c}:
- @p
- @<Header files to include@>@;
- @h
- @<External declarations@>@;
- @<Private declarations@>@;
- @<Internal functions@>@;
- @<External functions@>
- @ Every external variable is declared twice in this .{CWEB} file:
- once for {sc GB_,IO} itself (the ``real'' declaration for storage
- allocation purposes) and once in .{gb_io.h} (for cross-references
- by {sc GB_,IO} users).
- @<External declarations@>=
- long io_errors; /* record of anomalies noted by {sc GB_,IO} routines */
- @ @(gb_io.h@>=
- @<Header...@>@;
- extern long io_errors;
- /* record of anomalies noted by {sc GB_,IO} routines */
- @ We will stick to standard CEE/-type input conventions. We'll also have
- occasion to use some of the standard string operations.
- @<Header...@>=
- #include <stdio.h>
- #ifdef SYSV
- #include <string.h>
- #else
- #include <strings.h>
- #endif
- @* Inputting a line. The {sc GB_,IO} routines get their input from
- an array called |buffer|. This array is internal to {sc
- GB_,IO}---its contents are hidden from user programs. We make it 81
- characters long, since the data is supposed to have at most 79
- characters per line, followed by newline and null.
- @<Private...@>=
- static char buffer[81]; /* the current line of input */
- static char *cur_pos=buffer; /* the current character of interest */
- static FILE *cur_file; /* current file, or |NULL| if none is open */
- @ Here's a basic subroutine to fill the |buffer|. The main feature of interest
- is the removal of trailing blanks. We assume that |cur_file| is open.
- Notice that a line of 79 characters (followed by |'n'|) will just fit into
- the buffer, and will cause no errors. A line of 80 characters will
- be split into two lines and the |missing_newline|
- message will occur, because of the way |fgets| is defined. A |missing_newline|
- error will also occur if the file ends in the middle of a line, or if
- a null character (|' '|) occurs within a line.
- @<Internal...@>=
- static void fill_buf()
- {@+register char *p;
- if (!fgets(buffer,sizeof(buffer),cur_file)) {
- io_errors |= file_ended_prematurely; buffer[0]=more_data=0;
- }
- for (p=buffer; *p; p++) ; /* advance to first null character */
- if (p--==buffer || *p!='n') {
- io_errors |= missing_newline; p++;
- }
- while (--p>=buffer && *p==' ') ; /* move back over trailing blanks */
- *++p='n'; *++p=0; /* newline and null are always present at end of line */
- cur_pos=buffer; /* get ready to read |buffer[0]| */
- }
- @* Checksums. Each data file has a ``magic number,'' which is defined to be
- $$biggl(sum_l 2^l c_lbiggr) bmod p,.$$
- Here $p$ is a large prime number, and $c_l$ denotes the internal code
- corresponding to the $l$th-from-last
- data character read (including newlines but not nulls).
- The ``internal codes'' $c_l$ are computed in a system-independent way:
- Each character |c| in the actual encoding scheme being used has a
- corresponding |icode|, which is the same on all systems. For example,
- the |icode| of |'0'| is zero, regardless of whether |'0'| is actually
- represented in ASCII or EBCDIC or some other scheme. (We assume that
- every modern computer system is capable of printing at least 95
- different characters, including a blank space.)
- We will accept a data file as error-free if it has the correct number of
- lines and ends with the proper magic number.
- @<Private...@>=
- static char icode[256]; /* mapping of characters to internal codes */
- static long checksum_prime=(1L<<30)-83;
- /* large prime such that $2p+|unexpected_char|$ won't overflow */
- static long magic; /* current checksum value */
- static long line_no; /* current line number in file */
- static long final_magic; /* desired final magic number */
- static long tot_lines; /* total number of data lines */
- static char more_data; /* is there data still waiting to be read? */
- @ The |icode| mapping is defined by a single string, |imap|, such that
- character |imap[k]| has |icode| value~|k|. There are 96 characters
- in |imap|, namely the 94 standard visible ASCII codes plus space
- and newline. If EBCDIC code is used instead of ASCII, the
- cents sign rlap{.{kern.05em/}}.c should take the place of single-left-quote
- .{char``}, and .{char5}~should take the place of/~.{char`~}.
- All characters that don't appear in |imap| are given the same |icode|
- value, called |unexpected_char|. Such characters should be avoided in
- GraphBase files whenever possible. (If they do appear, they can still
- get into a user's data, but we don't distinguish them from each other
- for checksumming purposes.)
- The |icode| table actually plays a dual role, because we've rigged it so that
- codes 0--15 come from the characters |"0123456789ABCDEF"|. This facilitates
- conversion of decimal and hexadecimal data. We can also use it for
- radices higher than 16.
- @d unexpected_char 127 /* default |icode| value */
- @<Private...@>=
- static char *imap="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
- abcdefghijklmnopqrstuvwxyz_^~&@@,;.:?!%#$+-*/|\<=>()[]{}`'" n";
- @ Users of {sc GB_,IO} can look at the |imap|, but they can't change it.
- @<External fun...@>=
- char imap_chr(d)
- long d;
- {
- return d<0 || d>strlen(imap)? ' ': imap[d];
- }
- @#
- long imap_ord(c)
- char c;
- {
- @<Make sure that |icode| has been initialized@>;
- return (c<0||c>255)? unexpected_char: icode[c];
- }
- @ @(gb_io.h@>=
- #define unexpected_char @tquad@> 127
- extern char imap_chr(); /* the character that maps to a given character */
- extern long imap_ord(); /* the ordinal number of a given character */
- @ @<Make sure that |icode| has been initialized@>=
- if (!icode['1']) icode_setup();
- @ @<Internal...@>=
- static void icode_setup()
- {@+register long k;
- register char *p;
- for (k=0;k<256;k++) icode[k]=unexpected_char;
- for (p=imap,k=0; *p; p++,k++) icode[*p]=k;
- }
- @ Now we're ready to specify some external subroutines that do
- input. Calling |gb_newline()| will read the next line of
- data into |buffer| and update the magic number accordingly.
- @(gb_io.h@>=
- extern void gb_newline(); /* advance to next line of the data file */
- extern long new_checksum(); /* compute change in magic number */
- @ Users can compute checksums as |gb_newline| does, but they can't
- change the (private) value of |magic|.
- @<External f...@>=
- long new_checksum(s,old_checksum)
- char *s; /* a string */
- long old_checksum;
- {@+register long a=old_checksum;
- register char*p;
- for (p=s; *p; p++)
- a=(a+a+imap_ord(*p)) % checksum_prime;
- return a;
- }
- @ The magic checksum is not affected by lines that begin with .*.
- @<External f...@>=
- void gb_newline()
- {
- if (++line_no>tot_lines) more_data=0;
- if (more_data) {
- fill_buf();
- if (buffer[0]!='*')
- magic=new_checksum(buffer,magic);
- }
- }
- @ Another simple routine allows a user to read (but not write) the
- variable |more_data|.
- @(gb_io.h@>=
- extern long gb_eof(); /* has the data all been read? */
- @ @<External f...@>=
- long gb_eof() { return !more_data; }
- @* Parsing a line. The user can input characters from the buffer in several
- ways. First, there's a basic |gb_char()| routine, which returns
- a single character. The character is |'n'| if the last character on the
- line has already been read (and it continues to be |'n'| until the user calls
- |gb_newline|).
- The current position in the line, |cur_pos|, always advances when |gb_char|
- is called, unless |cur_pos| was already at the end of the line.
- There's also a |gb_backup()| routine, which moves |cur_pos| one place
- to the left unless it was already at the beginning.
- @(gb_io.h@>=
- extern char gb_char(); /* get next character of current line, or |'n'| */
- extern void gb_backup(); /* move back ready to scan a character again */
- @ @<External f...@>=
- char gb_char()
- {
- if (*cur_pos) return (*cur_pos++);
- return 'n';
- }
- @#
- void gb_backup()
- {
- if (cur_pos>buffer)
- cur_pos--;
- }
- @ There are two ways to read numerical data. The first, |gb_digit(d)|,
- expects to read a single character in radix~|d|, using |icode| values
- to specify digits greater than~9. (Thus, for example, |'A'| represents
- the hexadecimal digit for decimal~10.)
- If the next character is a valid |d|-git,
- |cur_pos| moves to the next character and the numerical value is returned.
- Otherwise |cur_pos| stays in the same place and $-1$ is returned.
- The second routine, |gb_number(d)|, reads characters and forms an
- unsigned radix-|d| number until the first non-digit is encountered.
- The resulting number is returned; it is zero if no digits were found.
- No errors are possible with this routine, because it uses
- |unsigned long| arithmetic.
- @(gb_io.h@>=
- extern long gb_digit(); /* |gb_digit(d)| reads a digit between 0 and |d-1| */
- extern unsigned long gb_number(); /* |gb_number(d)| reads a radix-|d| number */
- @ The value of |d| should be at most 127, if users want their programs to be
- portable, because CEE/ does not treat larger |char| values in a
- well-defined manner. In most applications, |d| is of course either 10 or 16.
- @<External f...@>=
- long gb_digit(d)
- char d;
- {
- icode[0]=d; /* make sure |' '| is a nondigit */
- if (imap_ord(*cur_pos)<d) return icode[*cur_pos++];
- return -1;
- }
- @#
- unsigned long gb_number(d)
- char d;
- {@+register unsigned long a=0;
- icode[0]=d; /* make sure |' '| is a nondigit */
- while (imap_ord(*cur_pos)<d)
- a=a*d+icode[*cur_pos++];
- return a;
- }
- @ The final subroutine for fetching data is |gb_string(p,c)|, which
- stores a null-terminated string into locations starting at~|p|.
- The string starts at |cur_pos| and ends just before the first appearance
- of character |c|. If |c=='n'|, the string will stop at the end of the line.
- If |c| doesn't appear in the buffer at or after |cur_pos|, the last character
- of the string will be the |'n'| that is always inserted at the end
- of a line, unless the entire line has already been read. (If the entire
- line has previously been read, the empty string is always returned.)
- After the string has been copied, |cur_pos| advances past it.
- In order to use this routine safely, the user should first check that
- there is room to store up to 81 characters beginning at location~|p|.
- A suitable place to put the result, called |str_buf|, is provided
- for the user's convenience.
- The location following the stored string is returned. Thus, if the
- stored string has length~|l| (not counting the null character that is
- stored at the end), the value returned will be |p+l+1|.
- @(gb_io.h@>=
- #define STR_BUF_LENGTH 160
- extern char str_buf[]; /* safe place to receive output of |gb_string| */
- extern char *gb_string(); /* |gb_string(p,c)| reads a string delimited by |c|
- into bytes starting at |p| */
- @ @d STR_BUF_LENGTH 160
- @<External f...@>=
- char str_buf[STR_BUF_LENGTH]; /* users can put strings here if they wish */
- char *gb_string(p,c)
- char *p; /* where to put the result */
- char c; /* character following the string */
- {
- while (*cur_pos && *cur_pos!=c)
- *p++=*cur_pos++;
- *p++=0;
- return p;
- }
- @ Here's how we test those routines in .{test_io}: The first line of test
- data consists of 79 characters, beginning with 64 zeroes and ending with
- `.{123456789ABCDEF}'. The second line is completely blank. The third
- and final line says `.{Oops:(intentional mistake)}'.
- @<Test the sample data lines...@>=
- if (gb_number(10)!=123456789)
- io_errors |= 1L<<20; /* decimal number not working */
- if (gb_digit(16)!=10)
- io_errors |= 1L<<21; /* we missed the .A following the decimal number */
- gb_backup();@+ gb_backup(); /* get set to read `.{9A}' again */
- if (gb_number(16)!=0x9ABCDEF)
- io_errors |= 1L<<22; /* hexadecimal number not working */
- gb_newline(); /* now we should be scanning a blank line */
- if (gb_char()!='n')
- io_errors |= 1L<<23; /* newline not inserted at end */
- if (gb_char()!='n')
- io_errors |= 1L<<24; /* newline not implied after end */
- if (gb_number(60)!=0)
- io_errors |= 1L<<25; /* number should stop at null character */
- {@+char temp[100];
- if (gb_string(temp,'n')!=temp+1)
- io_errors |= 1L<<26; /* string should be null after end of line */
- gb_newline();
- if (gb_string(temp,':')!=temp+5 || strcmp(temp,"Oops"))
- io_errors |= 1L<<27; /* string not read properly */
- }
- if (io_errors)
- exit_test("Sorry, it failed. Look at the error code for clues");
- if (gb_digit(10)!=-1) exit_test("Digit error not detected");
- if (gb_char()!=':')
- io_errors |= 1L<<28; /* lost synch after |gb_string| and |gb_digit| */
- if (gb_eof())
- io_errors |= 1L<<29; /* premature end-of-file indication */
- gb_newline();
- if (!gb_eof())
- io_errors |= 1L<<30; /* postmature end-of-file indication */
- @* Opening a file. The call |gb_raw_open("foo")| will open file |"foo"| and
- initialize the checksumming process. If the file cannot be opened,
- |io_errors| will be set to |cant_open_file|, otherwise
- |io_errors| will be initialized to zero.
- The call |gb_open("foo")| is a stronger version of |gb_raw_open|, which
- is used for standard GraphBase data files like |"words.dat"| to make
- doubly sure that they have not been corrupted. It returns the current value
- of |io_errors|, which will be nonzero if any problems were detected
- at the beginning of the file.
- @<Test the |gb_open| routine...@>=
- if (gb_open("test.dat")!=0)
- exit_test("Can't open test.dat");
- @ @d gb_raw_open gb_r_open /* abbreviation for Procrustean external linkage */
- @(gb_io.h@>=
- #define gb_raw_open gb_r_open
- extern void gb_raw_open(); /* open a file for GraphBase input */
- extern long gb_open(); /* open a GraphBase data file; return 0 if OK */
- @ @<External f...@>=
- void gb_raw_open(f)
- char *f;
- {
- @<Make sure that |icode|...@>;
- @<Try to open |f|@>;
- if (cur_file) {
- io_errors=0;
- more_data=1;
- line_no=magic=0;
- tot_lines=0x7fffffff; /* allow ``infinitely many'' lines */
- fill_buf();
- }@+else io_errors=cant_open_file;
- }
- @ Here's a possibly system-dependent part of the code: We try first to
- open the data file by using the file name itself as the path name;
- failing that, we try to prefix the file name with the name of the
- standard directory for GraphBase data, if the program has been compiled
- with |DATA_DIRECTORY| defined.
- @^system dependencies@>
- @<Try to open |f|@>=
- cur_file=fopen(f,"r");
- @^system dependencies@>
- #ifdef DATA_DIRECTORY
- if (!cur_file && (strlen(DATA_DIRECTORY)+strlen(f)<STR_BUF_LENGTH)) {
- sprintf(str_buf,"%s%s",DATA_DIRECTORY,f);
- cur_file=fopen(str_buf,"r");
- }
- #endif
- @ @<External f...@>=
- long gb_open(f)
- char *f;
- {
- strncpy(file_name,f,sizeof(file_name)-1);
- /* save the name for use by |gb_close| */
- gb_raw_open(f);
- if (cur_file) {
- @<Check the first line; return if unsuccessful@>;
- @<Check the second line; return if unsuccessful@>;
- @<Check the third line; return if unsuccessful@>;
- @<Check the fourth line; return if unsuccessful@>;
- gb_newline(); /* the first line of real data is now in the buffer */
- }
- return io_errors;
- }
- @ @<Private...@>=
- static char file_name[20]; /* name of the data file, without a prefix */
- @ The first four lines of a typical data file should look something like this:
- $$halign{hskip5em.{#}hfillcr
- * File "words.dat" from the Stanford GraphBase (C) 1993 Stanford Universitycr
- * A database of English five-letter wordscr
- * This file may be freely copied but please do not change it in any way!cr
- * (Checksum parameters 5757,526296596)cr}$$
- We actually verify only that the first four lines of a data file named |"foo"|
- begin respectively with the characters
- $$halign{hskip5em.{#}hfillcr
- * File "foo"cr
- *cr
- *cr
- * (Checksum parameters $l$,$m$)cr}$$
- where $l$ and $m$ are decimal numbers. The values of $l$ and~$m$
- are stored away as |tot_lines| and |final_magic|, to be matched at the
- end of the file.
- @<Check the first line...@>=
- sprintf(str_buf,"* File "%s"",f);
- if (strncmp(buffer,str_buf,strlen(str_buf)))
- return (io_errors |= bad_first_line);
- @ @<Check the second line...@>=
- fill_buf();
- if (*buffer!='*') return (io_errors |= bad_second_line);
- @ @<Check the third line...@>=
- fill_buf();
- if (*buffer!='*') return (io_errors |= bad_third_line);
- @ @<Check the fourth line; return if unsuccessful@>=
- fill_buf();
- if (strncmp(buffer,"* (Checksum parameters ",23))
- return (io_errors |= bad_fourth_line);
- cur_pos +=23;
- tot_lines=gb_number(10);
- if (gb_char()!=',')
- return (io_errors |= bad_fourth_line);
- final_magic=gb_number(10);
- if (gb_char()!=')')
- return (io_errors |= bad_fourth_line);
- @* Closing a file. After all data has been input, or should have been input,
- we check that the file was open and that it had the correct number of
- lines, the correct magic number, and a correct final line. The
- subroutine |gb_close|, like |gb_open|, returns the value of
- |io_errors|, which will be nonzero if at least one problem was noticed.
- @<Test the |gb_close| routine; exit if there's trouble@>=
- if (gb_close()!=0)
- exit_test("Bad checksum, or difficulty closing the file");
- @ @<External f...@>=
- long gb_close()
- {
- if (!cur_file)
- return (io_errors |= no_file_open);
- fill_buf();
- sprintf(str_buf,"* End of file "%s"",file_name);
- if (strncmp(buffer,str_buf,strlen(str_buf)))
- io_errors |= bad_last_line;
- more_data=buffer[0]=0;
- /* now the {sc GB_,IO} routines are effectively shut down */
- /* we have |cur_pos=buffer| */
- if (fclose(cur_file)!=0)
- return (io_errors |= cant_close_file);
- cur_file=NULL;
- if (line_no!=tot_lines+1)
- return (io_errors |= wrong_number_of_lines);
- if (magic!=final_magic)
- return (io_errors |= wrong_checksum);
- return io_errors;
- }
- @ There is also a less paranoid routine, |gb_raw_close|, that
- closes user-generated files. It simply closes the current file, if any,
- and returns the value of the |magic| checksum.
- Example: The |restore_graph| subroutine in {sc GB_,SAVE} uses
- |gb_raw_open| and |gb_raw_close| to provide system-independent input
- that is almost as foolproof as the reading of standard GraphBase data.
- @ @d gb_raw_close gb_r_close /* for Procrustean external linkage */
- @(gb_io.h@>=
- #define gb_raw_close gb_r_close
- extern long gb_close(); /* close a GraphBase data file; return 0 if OK */
- extern long gb_raw_close(); /* close file and return the checksum */
- @ @<External f...@>=
- long gb_raw_close()
- {
- if (cur_file) {
- fclose(cur_file);
- more_data=buffer[0]=0;
- cur_pos=buffer;
- cur_file=NULL;
- }
- return magic;
- }
- @* Index. Here is a list that shows where the identifiers of this program are
- defined and used.