dumbfts
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:really simple full-text search
Dumb Full Text Search
=====================

This is a full-text indexing and search engine for my email, in about
300 lines of code.  It can produce results for many (most?) queries on
my 12-gigabyte mailbox in under two seconds on my 700MHz laptop.
(Which sometimes reduces its clock rate to 550MHz.)  I just updated
the index after not doing so for three months, and it took about three
hours to index the two new gigabytes of email.  It manages to index
about 24 megabytes per minute, which is a bit faster than I
can download my email over my internet connection.

From [Earl] (http://earl.strain.at/space/comments-2008-12-23):

> 8m18s to index a 1.3 GB mbox on my dual-core 2.6GHz notebook. The
> resulting index is 1.2GB according to `du` and spread over 27
> segments.
> 
> Without any post-processing, most queries take between 6 to 10
> seconds wall clock on the same machine, as measured by `time fts
> mbox query > /dev/null`. Query time doesn't change much if I store
> the index in memory (on a ramfs), but I've 4GB memory here, so the
> index was in a fs cache even without ramfs.

Perhaps unsurprisingly, that’s about 6× faster at indexing than my
machine.  I have some hope that he’ll get better query performance
with some changes I’ve made since then.

Feature list
------------

- full-text email search
- incremental reindexing
- fielded search on arbitrary mail headers
- partial-word search
- multi-term search
- extreme internal simplicity
- all in Python except for a bash script

Applicability
-------------

Obviously such a simple system is somewhat limited in its
applicability.  Here are some things that must be true in order to use
it:

1.  The entire corpus to be searched is in a single big Berkeley mbox
    file.
2.  As with `grep` or `mboxgrep`, the objective is to get all of the
    results, without ranking them in terms of relevance.
3.  The corpus size is in the tens-of-megabytes to tens-of-gigabytes
    range.
4.  The mbox file is updated only by appending onto the end.  (This is
    a reasonable assumption given #1 and #3.)
5.  Given #3, incremental reindexing is desirable.  Given #4, it is
    feasible.
6.  You have enough disk space for an index that is the same size as
    the original mbox file.
7.  The mail and the index are stored on a spinning-rust
    electromechanical disk, on which non-sequential reads imply a
    delay of 5–20 milliseconds, not a solid-state Flash disk (“SSD”)
    where the corresponding delay is around 0.1 milliseconds.  (It
    should still work fine on a Flash disk but it could be a lot more
    efficient.)
8.  You don’t care about full-text search finding words in attachments
    or base64-encoded message bodies, or partially-encoded words in
    quoted-printable-encoded message bodies, or encoded message headers.
9.  You’re on a Unix system, such as Linux or Mac OS X.
10. You don’t mind manually administering the index.
11. The index is being used to answer interactive queries on a
    single-user machine with a single-user corpus, so the thing to
    optimize for is query latency when the index is on disk, not, say,
    query throughput or query latency with an in-RAM index.

Architecture
------------

It stores a bunch of index segments in an index directory; each one is
a sorted list of postings.  Off to the side of each index segment,
there is a “skip file” which is smaller, about 1400 times smaller in
my case, although that’s a tunable parameter; it contains about one
out of every 1400 or so postings from the index segment, along with
the position where it can be found in the index segment.  Index
segments are generated at some manageable size and then merged into
new, larger index segments later on.

If this design sounds familiar it’s probably because it’s exactly like
Lucene.

So, to find a term in a 2.0-gibibyte index segment, we start reading
through the skip file (1.3 mebibytes in my example case), and stop as
soon as we find something later than the term, on average a little
over halfway through.  At that point we open the index segment, seek
to the last posting mentioned in the skip file before the term we’re
seeking, and then read sequentially through the index file until we
find all the postings we’re looking for.

Most terms have relatively few postings — in fact, the vast majority
have only a single posting, although those tend not to get searched
for — so we will usually have to read less than 1400 postings from the
index segment, maybe a bit over 700 on average.

Now, actually, the above is a bit of a lie.  Because this skip file is
so large, it has a skip file of its own, which is only 1219 bytes.  So
rather than plow through the 1.3 mebibytes of the skip file looking
for the term, we read through (about half of) the 29 entries in the
skip file’s skip file, then read through up to 1400 or so entries
somewhere in the middle of the main skip file, and then read through
up to 1400 or so entries in the index segment itself.  So that’s three
seeks and a bit over 1400 entries of I/O, totaling about 50 kilobytes.
My laptop disk transfers almost 9 megabytes per second, so that’s
about 6ms of I/O bandwidth.

So to look up a single rare term in a single index segment is 6ms of
data transfer, 30ms of seeking, and processing 1400 lines in Python,
so somewhere around 30ms.  To look up N rare terms in M index files is
then N×M×30ms.  Right now I have 10 index segments.

Once the posting lists have been fetched, they are intersected to get
a list of search hits, which are messages that are then fetched from
the mailbox file.

Looking up common terms is more expensive because the number of
postings to process can far exceed the number of postings to sift
through to find them.  For example, something like 0.7% of my index
files consist of postings for the word “from”, which occurs in every
mail message.  In the 2-gibibyte segment I was using as an example,
there are presumably about 600 000 postings for the word “from”,
consuming 14 mebibytes.  Reading 14 mebibytes from disk is going to
take almost two seconds by itself, let alone the time to parse them.

The index segments are plain ASCII text files (as long as the corpus
is plain ASCII text), with one posting per line: first the lowercase
version of the word itself, then the name of the header field in which
the word was found (if any), then the byte offset in the mailbox where
the message containing the posting begins.  Here are some example
postings from one of my index segments:

    headers 98697592
    headers 99125817
    headers x-asf-spam-status 47638180
    headhunter 59154903
    headhunter 65973658
    headhunters 110711591

This means that, for example, the message in my mailbox starting at
byte 47638180 contains the word “headers” in the message header
“X-ASF-Spam-Status”, probably as part of the phrase
“tests=MISSING_HEADERS”.

Usage
-----

To generate new index segments and skip files for a mailbox:

    $ indexmail.py ~/mail/mailbox

To find all messages containing a word beginning with “curt”:

    $ fts ~/mail/mailbox curt | less

To find all messages containing the word “curt”:

    $ fts ~/mail/mailbox 'curt ' | less

To find all messages containing the word “curt” in the “From:” header:

    $ fts ~/mail/mailbox 'curt from ' | less

To find all messages containing the words “Curt” and “Martindale” in
the “From:” header:

    $ fts ~/mail/mailbox 'curt from ' 'martindale from ' | less

But you have to manually administer the index to get anything
approaching reasonable performance.

The index is in `~/mail/mailbox.idx`.  New index segments are created
in this directory with names like “.new-index.segment.6244.1” and
atomically renamed to “index-segment.6244.1” once they have been
generated.  Skip files for them, which are generated automatically,
should be in files with the same name, but with
“.skip.” at the beginning of their names, like
“.skip.index-segment.6244.1”.  Sometimes skip files have skip files 
of their own, with names like “.skip.skip.index-segment.6244.1”.

If you have a whole bunch of index segments and it’s slowing down your
searches, you should merge them.  Any file in the index directory
whose name doesn’t begin with a “.” will be used as an index segment.

    $ LANG=C sort -m ~/mail/mailbox.idx/index-segment.* \
                  -o ~/mail/mailbox.idx/.new.merged-segment-serno
    $ mv ~/mail/mailbox.idx/.new.merged-segment-serno \
         ~/mail/mailbox.idx/merged-segment-serno
    $ rm ~/mail/mailbox.idx/index-segment.*

In the third step, make sure you’re only deleting the
`index-segment.*` files that you merged together in the first step.

Then you need to generate skip files for the new merged segments.
To generate all needed skip files
automatically:

    $ buildskips.py 50000 ~/mail/mailbox.idx/*

If you want to build a single skip file manually, although I think the
necessity for this is past:

    $ sparse.py 50000 ~/mail/mailbox.idx/merged-segment-serno > \
                      ~/mail/mailbox.idx/.skip.index-segment-serno

The number “50000” is a tunable parameter called “chunksize”.  50000
is a reasonable value.  I used to use 200000, but 50000 gives me much
better performance.  12500 gives about the same performance as 50000
(more disk, less CPU) but takes a lot longer to build the skipfiles.
50000 is the `chunksize` that `indexmail.py` uses when it initially
generates skipfiles.

There’s a file in the index directory called `.indexed-up-to` that
tells how much of the mailbox file has been indexed.

Tuning
------

Chunksize is the approximate number of index segment bytes to go
between skip file entries.  It should be in the neighborhood of the
product of your disk’s access latency (seek time plus rotational
latency) and its read bandwidth.  So if your access latency is 8ms and
your read bandwidth is 9MB/s, it should be in the neighborhood of 8 ×
9000 = 72000.  Making it larger will give you smaller skip files;
making it smaller will give you larger skip files.  If your skip files
get too big, you can make skip files `.skip.skip.index-segment.6244.1`
for the skip files, and so on ad infinitum.  But every level of skip
files costs a seek.

Merging all of your index segments into a single big index segment
will always give you better query performance, but it takes a long
time and can take a lot of disk space for the intermediate results.
Trying to merge too many segments at once could in theory make the
merge slower, but I think the number of segments where you have that
problem is in the neighborhood of (RAM size / chunksize), so these
days you should be able to merge tens of thousands of index segments
at once, so that’s probably a problem of the past.

Probably the best merging policy is something resembling the
following:

- Delay merging while you’re actively indexing.
- Merge whenever some index segment and all of the smaller index
  segments total more than N times the size of that index segment.  A
  good value for N might be in the range of 4–10.

I think this keeps the number of index segments to a maximum of
something like 1 + (N-1) * log(indexsize / min_index_segment_size) /
log(N).  “min_index_segment_size” is normally about 50MB.  If
“indexsize” is 100GB, that ratio is 2000; with an N of 4, the ratio of
the logarithms is under 6, so the maximum number of index segments is
19.  With an N of 10, the maximum increases to 37.  In either case,
the usual number of index segments is much smaller.

The tradeoff here is that a smaller N means you spend more time
merging, inversely proportional to log(N).  In a 100GB index with N=4,
each posting will pass through about 6 merges on the way to the final
index, for a final I/O cost of 1100GB.  If N=10, it need only pass
through about 3 merges.

The program `merge.py` examines a set of index segments and suggests
such a merge, using N=5, if possible.

A smarter merging policy might perform smaller merges more eagerly,
bigger merges more reluctantly, and perform merges more eagerly when
the total number of index segments is large.

Bugs
----

- It doesn’t automatically merge indices.
- It doesn’t do any kind of query optimization to avoid fetching
  millions of useless postings for common words.  This limits its use
  to corpuses of a few million documents at most.
- Given the index structure, at least for full-word searches or
  full-word fielded searches, it would be straightforward to use a
  sort-merge join to produce results incrementally.  Instead it uses a
  hash join because that’s like 15 lines of Python and doesn’t require
  making a duplicate non-fielded posting of words that occur in
  headers.
- It should generate skip files as it writes the index segment, rather
  than requiring another read of the index segment to do that.
- It should use `gzip` or something to reduce the size of the index
  files.  Gzipping a 129MB segment on my machine takes about 135 CPU
  seconds, compressing it to 33MB, but gunzipping it takes only about
  6 CPU seconds.  So the 270kiB we retrieved in 30ms in the
  “Architecture” scenario would have been compressed to 69kiB, which
  would have taken 13ms to decompress (minus 2ms of saved disk
  bandwidth), slowing down searches by about a 3:4 ratio in exchange
  for using a quarter of the disk space.
- Index-segment merges are “big-bang” affairs, using a lot of
  temporary disk space and a lot of time all at once, and this problem
  gets worse as corpus sizes get bigger.  If index segments output
  from a merge were partitioned into independently-mergeable subfiles
  by key-space subdivision, this could 
- There’s a bug in `skiplook.py` that will prevent successful
  retrieval of the first posting in an index segment.
- It doesn’t have an option to return results from the part of the
  mailbox that hasn’t been indexed yet.
- The information about which part of the mailbox has been indexed is
  stored separately from the index segments themselves.  This means
  you can’t, for example, delete a corrupt index segment in order to
  recreate it; you can’t generate index segments on a fast machine and
  transmit them to a slow machine; you can’t partition index-segment
  generation across a cluster with any kind of reasonable failure
  handling; you can’t detect whether coverage is duplicated between a
  merged segment and some input segment you failed to delete.  This
  coverage information ought to be stored in the index segments
  themselves.

Provenance
----------

The awesomely simple data format is due to Dave Long.

The overall structure that allows it to be efficient is Lucene’s,
which is Doug Cutting’s design, although of course Lucene differs from
this software by not being stupid.

Kragen Javier Sitaker wrote the software in 2008 and 2009.  It is dedicated to
the public domain.

Misc
----

Document made with Markdown.




本源码包内暂不包含可直接显示的源代码文件,请下载源码包。