semi_index
文件大小: unknow
源码售价: 5 个金币 积分规则     积分充值
资源说明:Implementation of the JSON semi-index described in the paper "Semi-Indexing Semi-Structured Data in Tiny Space"
semi_index
==========

This library implements the algorithm described in [*Semi-Indexing
Semi-Structured Data in Tiny Space*](http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf).

The aim is to speed-up the processing of large collections of JSON
documents when only a small subset of the values in each document is
needed at each scan. An example is given by MapReduce jobs on
semi-structured logging data, where usually two or three values are
needed but the average document size can be in the hundreds of
kilobytes. Another example is the creation of indexes in document
databases such as CouchDB and MongoDB.

The main idea is, given a collection of JSON documents, to generate an
auxiliary file that encodes the structure of the documents, so that in
the parsing phase the parser can be pointed directly at the positions
where the values are, without having to parse the whole document. Such
an auxiliary file is called a *semi-index*.

This auxiliary file can be create once and used in all the subsequent
scans. By using *succinct data structures* the size of the auxiliary
file can be kept relatively small compared to the size of the
collection.

The reasons for the speed-up are two-fold:

* Since the structure of the document is known, the parser can be
  pointed to the position in the file where the values we are
  interested in are located. This way a considerable amount of I/O can
  be saved. 
* For the same reason, there is no need to perform a full parsing of
  the document into a document tree. This reduces both CPU time and
  memory usage.
  
  
Depending on the characteristics of the documents and how the
semi-index is used, the speedups reported in the paper range between
2x and 12x.
  
Usage example: `json_select`
----------------------------

An example application is provided with the library: `json_select`. 

`json_select` scans a collection of JSON documents given as lines of a
text file, extracts a set of given attributes and outputs them as a
collection of JSON lists.

For example consider the file `example.json`, which is a sample from
Wikipedia edits log.

    $ head -3 example.json 
    {"comment": "content was: '[[Media:Example.og[http://www.example.com link title][http://www.example.com link title]''Italic text'''''Bold text'''jjhkjhkjhkjhkjhjggghg]]'", "logtitle": "Vivian Blaine", "timestamp": "2004-12-23T03:20:32Z", "action": "delete", "params": "", "contributor": {"username": "Slowking Man", "id": 56299}, "type": "delete", "id": 1}
    {"comment": "{{GFDL}} {{cc-by-sa-2.0}}", "logtitle": "File:Mini Christmas tree.png", "timestamp": "2004-12-23T03:24:26Z", "action": "upload", "params": "", "contributor": {"username": "Fredrik", "id": 26675}, "type": "upload", "id": 2}
    {"comment": "content was: 'Daniel Li is an amazing human being.'", "logtitle": "Daniel Li", "timestamp": "2004-12-23T03:27:51Z", "action": "delete", "params": "", "contributor": {"username": "Slowking Man", "id": 56299}, "type": "delete", "id": 3}

If we want to extract the attributes `logtitle` and `contributor.id`
from attribute we can use the following command:

    $ ./json_select naive_parse_stream logtitle,contributor.id < example.json | head -3
    ["Vivian Blaine",56299]
    ["File:Mini Christmas tree.png",26675]
    ["Daniel Li",56299]
    
`naive_parse_stream` works by loading each document and parsing it
with [JsonCpp](http://jsoncpp.sourceforge.net/).

To show the speed-up that can be obtained by using a semi-index it is
better to use a bigger file. We use `wp_history.json` from the
[datasets used in the paper](http://www.di.unipi.it/~ottavian/json_datasets.tar.bz2).

We get the id, title and timestamp of the last contribution from each
page.

    $ sudo ./drop_caches.sh && time ./json_select naive_parse_stream id,title,revision[-1].timestamp < wp_history.json > /dev/null

    real	1m0.408s
    user	0m34.630s
    sys	0m3.260s

`drop_cache.sh` drops all the kernel page caches to ensure that the
file is read from disk each time. The script supports only Linux.

Now we create a semi-index on the file with the following command:

    $ sudo ./drop_caches.sh && time ./json_select si_save wp_history.json.si < wp_history.json > /dev/null
    [ ... DEBUG OUTPUT ... ]
    real	0m37.710s
    user	0m8.890s
    sys	0m2.880s

Note that the time needed to create the semi-index is less then the
time needed for a scan+parse!

For this very low density file the semi-index is negligibly small,
compared to the raw collection:

    2.8G	wp_history.json
    9.7M	wp_history.json.si

For more typical file, the overhead is around 10%.

We can now extract the same attributes as before but taking advantage
of the semi-index.

    $ sudo ./drop_caches.sh && time ./json_select saved_si_parse_mapped wp_history.json wp_history.json.si id,title,revision[-1].timestamp > /dev/null

    real	0m11.282s
    user	0m0.030s
    sys 	0m0.380s

Using the semi-index, the extraction is almost 6 times faster than
normal parsing. Using a compressor that supports random-access on the
JSON file further speedups are possible thanks to the reduced I/O. See
the source code of `json_select` for the details.

How it works
------------

Semi-indexing is a technique that can be applied to most text
serialization formats (such as XML). The generic scheme is described
in full detail in the paper. Here we give a simple explanation of how
it works when applied to JSON.

The JSON semi-index consists of two binary strings, `pos` and
`bp`, which encode respectively the positions of the structural
elements of JSON (i.e. `{}[]:,`) and the structure of the parsing
tree, encoded as a sequence of balanced parentheses.

        {"a": 1, "b": {"l": [1, null], "v": true}}
    pos 110000100100001100001100100000010000100000
    bp  (()()()(()(()())()()))

Note that each `1` in the `pos` string is associated to two
consecutive parentheses in the `bp` string.

One property of `pos` is that it is very sparse, because usually keys
and values are at least a few characters long, hence it is very
compressible. Using an Elias-Fano encoding, a space close to the
information-theoretical optimum can be reached while allowing
efficient random-access and powerful operations such as `select`,
which gives the position of the `i`-th `1` bit in the string.

On the other hand `bp`, which is very small (it is twice as long as
the number of ones in `pos`) can be augmented with data structures
that enable quick tree-like navigation. This structure take only a
negligible amount of space compared to the string. 

Navigation in the JSON object is performed by navigating the tree
represented by `bp` and then retrieving the key/values from the JSON
document by pointing to the the positions obtained from `pos`.

How to build the code
---------------------

### Dependencies ###

The following dependencies have to be installed to compile the library.

* CMake >= 2.6, for the build system
* zlib
* Boost >= 1.42

Also, the library `succinct` has to be downloaded as a git submodule,
so the following two commands have to be executed *before running
cmake*:

    $ git submodule init
    $ git submodule update

`json_select` and the performance tests also depend on MongoDB and
JsonCpp, but their sources are included in the source tree so they
don't need to be installed.

### Supported systems ###

The code has been developed and tested mainly on Linux, but it has
been tested also on Mac OS X and Windows 7.

The code has been tested only on x86-64. Compiling it on 32bit
architectures would probably require some work.

### Building on Unix ###

The project uses CMake. To build it on Unix systems it should be
sufficient to do the following:

    $ cmake . -DCMAKE_BUILD_TYPE=Release
    $ make

It is also advised to perform a `make test`, which runs the unit tests.

### Building on Windows ###

On Windows, Boost and zlib are not installed in default locations, so
it is necessary to set some environment variables to allow the build
system to find them.

* For Boost `BOOST_ROOT` must be set to the directory which contains
  the `boost` include directory.
* For zlib `CMAKE_PREFIX_PATH` must be set to the directory that
  contains `zlib.h`
* Both the directories that contain the Boost and zlib DLLs must be
  added to `PATH` so that the executables find them

Once the env variables are set, the quickest way to build the code is
by using NMake (instead of the default Visual Studio). Run the
following commands in a Visual Studio x64 Command Prompt:

    $ cmake -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Release .
    $ nmake
    $ nmake test


本源码包内暂不包含可直接显示的源代码文件,请下载源码包。