libgist User Manual

Table of Contents


This document explains how to use the libgist interfaces to create, manage and search indices using existing libgist extensions (B-tree and R-tree extensions are included in the distribution). Extensions implement the customizable parts of the GiST index structure and are used to "teach" a GiST how to behave like a particular access method. If you need to write your own extension, please refer to the Guide to writing libgist extensions for the details. Libgist for Application Programmers in this User Manual explains, in general terms, how to use the methods of the core libgist application-oriented classes.  Using the B-Tree Extension explains the specific details of the B-tree extension and gives an example of how to use it, Using the R-Tree Extension does the same for the R-tree extension and Summary of Classes and Header Files contains a list of the classes and their header files.

Libgist for Application Programmers

The central class for libgist application programmers is gist_m, which is used to create and interact with GiST index objects. Libgist stores each GiST index in its own operating system file. Objects of class gist_m, when connected to an index file, allow the application to manipulate and query the corresponding indices through their class methods. The method gist_m::create(filename, extension) creates an empty index file with the given name and connects the gist_m object to that file. All subsequent operations on the index object use the given extension. The index can be destroyed by simply removing the corresponding file. An index object is connected to a previously created index file with gist_m::open(filename, extension). The extension provided as a parameter must be the same as the one that was previously used to create and update the index. Analogously to gist_m::create(), the given extension is used for all subsequent operations on the index. While an index object is connected to an index file, the most frequently accessed nodes of the index are buffered in main memory to improve performance. To make sure that changes are written back to the index file, two procedures are provided: gist_m::flush() saves the changes to disk, and gist_m::close() additionally removes the connection between the gist_m object and the index file.

A GiST index supports the standard index update mechanisms, insertion and deletion of keys. A new (key, data pointer) pair is inserted into the index with gist_m::insert(key, keylen, data, datalen). The key and data parameters point to memory locations that hold the key and data pointer, the format of which is defined by the extension used with the index object at hand. The libgist core classes themselves do not interpret either datum, but they need to know their sizes in bytes for purposes of storage management, which explains the keylen and datalen parameters. Index entries are deleted with gist_m::remove(query), which requires a query (more on these in the next paragraph) that describes the keys of the target entries. Libgist does not support single-entry deletion other than through a user-constructed query that will retrieve a single entry, because the equality selection predicate required to locate the entry is not part of the standard GiST interface.

Libgist supports searches on index trees through the well-known iterator interface (or cursor interface, to put it in SQL-compliant terms). Before running a query, the application must initialize a gist_cursor_t object with the query it wants to run (gist_m::fetch_init(cursor, query)). As with the keys in the index entries, the format of a query and its interpretation is defined by the particular extension used. The (key, data pointer) pairs from the index entries which match the query are returned to the application through gist_m::fetch(key, keylen, data, datalen, eof). The key and data parameters must point to application-allocated storage space that is large enough to hold whatever is in the index entry. The actual size of the key and data pointer in bytes is returned through keylen and datalen. The maximum sizes are usually dictated by the particular extension that is used for the index tree (but cannot exceed gist_p::max_tup_sz, the maximum tuple size dictated by the libgist implementation). The output parameter eof is set to true after the last matching entry has been fetched (the gist_m::fetch() call that retrieves the last matching item will still have eof set to false).

There are a couple of ancillary interface procedures that make debugging new extensions a little easier. Firstly, it is possible to check that the bounding predicate on each page contains every key on the page; gist_m::check() performs this check and reports any violations to stderr. Secondly, you can inspect the content of an index tree with gist_m::dump(pageno) by dumping the entire tree (pageno = 0) or selected pages (pageno = desired page). The report on stdout includes page statistics (how many slots used, how many bytes free etc.), the bounding predicate and the keys and child/data pointers of all the entries.

Using the B-Tree Extension

The B-tree extension available as part of the libgist distribution implements standard B+-trees, which maintain entries in sorted order on each index page and compress the key interval BPs (used as the keys in entries on internal nodes) to their left boundary. Node entries are sorted on their keys and entries with duplicate keys are further sorted on their data pointers.

The extension has two classes: bt_query_t, which defines the query operators implemented for B-trees (the usual =, <, <=, >, >= and a range operator like SQL BETWEEN), and bt_ext_t, which is the class for B-tree extension objects (there's a third class, gist_ubt_ext_t, which implements "unordered" B-trees in the purist GiST tradition; it is only provided as an example of how to write extensions, not as an alternative to ordered B-trees, since unordered B-trees are strictly inferior to regular B+-trees in terms of performance).

Currently, the B-tree extension contains two (statically defined) bt_ext_t instances: one for integers (bt_int_ext) and one for strings (bt_str_ext). The data pointers expected by either extension object are 4-byte integers. Here's a short example that creates an index stored in file "btree-file", inserts two integers with integer data pointers and then fetches all entries in the interval [4, 6].

gist_m bt_index;
(void) bt_index.create("btree-file", &bt_int_ext);
int key1 = 5;
int data1 = 1;
int key2 = 3;
int data2 = 2;
(void) bt_index.insert((void *) &key1, sizeof(int), (void *) &data1, sizeof(int));
(void) bt_index.insert((void *) &key2, sizeof(int), (void *) &data2, sizeof(int));
bt_query_t q(bt_query_t::bt_betw, new int(4), new int(6));
gist_cursor_t cursor;
bt_index.fetch_init(cursor, &q);
bool eof = false;
int key, keysz, data, datasz;
while (!eof) {
    (void) bt_index.fetch(cursor, key, keysz, data, datasz, eof);
    if (!eof) // process key and data...
Creating a new B-tree extension in the form of a bt_ext_t object for a particular datatype requires a few datatype-specific procedures (click here to see the bt_ext_t constructor):

Using the R-Tree Extension

The R-tree extension available with libgist implements standard R-trees as described in the original paper by A. Guttman; no attempt at R*-tree-style reinsertion is made. Its splitting algorithm is quadratic (first, it finds the most distant items, then it clusters the remaining entries on the page around the two distant seeds).

At the heart of the R-tree extension is the extension class, rt_ext_t, of which there are two extension objects: rt_point_ext for point data and rt_rect_ext for rectangle data. Both of these extensions can handle data of arbitrary dimensionality. Points are stored on the page as arrays of doubles, where the length of the array corresponds to the dimensionality of the point. Rectangles are stored as arrays of intervals, which are each composed of two double boundaries. The data pointers are 4-byte integers for either of these extension objects. The class rt_query_t defines the operators supoorted by the R-tree extension, which are: contained(), contains(), equal(), overlaps(). There are two ancillary classes, rt_point and rt_rect, which are used internally as the in-memory representation of points and rectangles and which also function as parameters to queries (see the example further down the page). Creating an R-tree extension object for datatypes other than points and rectangles is not possible in the current implementation.

Here's an example that inserts two 2-dimensional points into an R-tree, runs a query on them and afterwards deletes the retrieved items.

gist_m rt_index;
rt_index.create("rtree-file", &rt_point_ext);
double key1[] = {50.0, 60.0};
int data1 = 1;
double key2[] = {10.0, 20.0};
int data2 = 2;
rt_index.insert((void *) key1, sizeof(key1), (void *) &data1, sizeof(data1));
rt_index.insert((void *) key2, sizeof(key2), (void *) &data2, sizeof(data2));
double rect1[] = {5.0, 15.0, 15.0, 25.0};
rt_rect r1(2, rect1);
rt_query q(rt_query_t::rt_overlap, rt_query_t::rt_rectarg, (void *) &r1);
gist_cursor_t cursor;
rt_index.fetch_init(cursor, &q);
bool eof = false;
rt_point key(2);
int keysz, datasz, data;
while (!eof) {
    (void) rt_index.fetch((void *) key.coord, keysz, (void *) &data, datasz);
    if (!eof) // process key and data...

Summary of Classes and Header Files

Class Header File Description 
gist_m gist.h GiST index class
gist_cursor_t gist.h GiST query cursor class
bt_ext_t gist_btree.h B-tree extension class
rt_ext_t gist_rtree.h R-tree extension class

Appendix A: gist.h

class gist_cursor_t;
class gist_lstk;
class gist_ustk;
class gist_p;
class gist_ext_t;

class gist_m {



    // create an empty Gist
    rc_t                        create(
        const char*                 filename,
        gist_ext_t*                 ext);

    // open an existing Gist
    rc_t                        open(
        const char*                 filename,
        gist_ext_t*                 ext);

    rc_t                        close();
    rc_t                        flush(); 

    // insert a single key
    rc_t                        insert(
        const void*                 key, 
        const int                   keyLen,
        const void*                 dataPtr,
        const int                   dataPtrLen);

    // remove all keys matching the query
    rc_t                        remove(
        const void*                 query);

    // initialize the cursor
    rc_t                        fetch_init(
        gist_cursor_t&              cursor,
        const void*                 query);

    // fetch the next data item
    rc_t                        fetch(
        gist_cursor_t&              cursor,
        void*                       key,
        smsize_t&                   keyLen,
        void*                       dataPtr,
        smsize_t&                   dataPtrLen,
        bool&                       eof);

    // returns true if tree contains no data
    bool                        is_empty();

    // checks structural and data integrity of the tree;
    // outputs violations to stderr
    rc_t                        check();

    // dumps the content of a page to stdout;
    // if 0 is specified as the pgno, dumps out the tree
    rc_t                        dump(
        shpid_t                     pgno = 0);


Appendix B: gist_btree.h

class bt_query_t {

    enum bt_oper {
        bt_betw, // SQL BETWEEN operator

    bt_query_t(bt_oper oper, void *val1, void *val2) : oper(oper), val1(val1), val2(val2) {}
        if (val1 != NULL) delete val1;
        if (val2 != NULL) delete val2;

    bt_oper oper;
    void *val1;
    void *val2; // only used as upper interval bound for bt_betw

class bt_ext_t : public gist_ext_t {

    // generic comparison function
    typedef int (*CmpFct)(const void *a, const void *b);
    CmpFct keyCmp; // for keys
    CmpFct dataCmp; // for data pointers

    // returns the size in bytes of a key stored on a disk page;
    // the goal is to get to the data pointer portion of an internal entry,
    // given the pointer to the key portion
    typedef int (*SizeFct)(const void *e);
    SizeFct size;

    // for printPred()
    typedef void (*PrintFct)(const void *pred);
    PrintFct prPred, prData;

    // for pickSplit() (creating new BPs)
    typedef void (*NegInftyFct)(void *x);
    NegInftyFct negInftyKey, negInftyData;

        CmpFct keyCmp,
        CmpFct dataCmp,
        SizeFct size,
        PrintFct prPred,
        PrintFct prData,
        NegInftyFct negInftyKey,
        NegInftyFct negInftyData);

    rc_t insert(
        gist_p& page,
        const vec_t& key,
        const vec_t& data,
        shpid_t child);

    rc_t remove(
        gist_p& page,
        int slots[],
        int numSlots);

    rc_t updateKey(
        gist_p& page,
        int& slot,
        const vec_t& newKey);

    void findMinPen(
        const gist_p& page,
        const vec_t& key,
        const vec_t& data,
        int& slot);

    void search(
        gist_p& page,
        const void* query,
        int matches[],
        int& numMatches);

    void getKey(
        gist_p& page,
        int slot,
        vec_t& key);

    rc_t pickSplit(
        gist_p& page,
        int rightEntries[],
        int& numRight,
        const vec_t& oldBp,
        vec_t& leftBp,
        vec_t& rightBp,
        const vec_t& entry1,
        bool& oneGoesRight,
        const vec_t& entry2,
        bool& twoGoesRight);

    void unionPage(
        gist_p& page,
        vec_t& bp,
        bool& bpChanged); 

    void unionKey(
        vec_t& bp,
        int level, 
        const vec_t& pred,
        bool &bpChanged);

    bool check(
        const vec_t& bp,
        const vec_t& pred,
        int level);

    void printPred(
        const vec_t& pred,
        int level);

    void printData(
        const vec_t& data);

extern bt_ext_t bt_int_ext;
extern bt_ext_t bt_str_ext;

Appendix C: gist_rtree.h

class rt_point;

class rt_rect {

    rt_rect(int d); // preset with 0 coords
    rt_rect(int d, const double *coord); // coords left as-is
    rt_rect(int d, double *coord); // coords left as-is, can be changed
    rt_rect(rt_rect &r);

    void dealloc();
    bool isEqual(rt_rect & k);
    int dim() { return dimension; }
    double span();
    double margin();
    double dist(rt_rect *s);
    double & lo(int i);
    double & hi(int i);

    // expanding this rect to include another point or rect
    // return TRUE if successful, FALSE otherwise
    bool expand(rt_point & pt);
    bool expand(rt_rect & rect);

    // restricting this rect to its overlap with another rect
    // return TRUE if successful, FALSE otherwise
    bool intersect(rt_rect & rect);

    // test relationship with another rect or point:
    bool overlaps(rt_rect & rect);
    bool contains(rt_rect & rect);
    bool contains(rt_point & pt);

    int dimension; 
    double *coord;

class rt_point {

    rt_point(int d);
    rt_point(int d, double *coord) : dimension(d), coord(coord) {}
    rt_point(int d, const double *coord) : dimension(d), coord((double *) coord) {}

    // accessors:
    double & operator [] (int i);

    void dealloc();
    bool isEqual(rt_point & k);
    int dim();
    double span();
    double margin();
    double dist(rt_point *s);
    double & crd(int i);

    // test relationship with another point or rect
    bool contains(rt_rect & rect);
    bool contains(rt_point & pt);

    int dimension;
    double *coord;  // keep coordinates in all dimensions

class rt_query_t {
    enum rt_oper {

    enum rt_arg {

    // assumes that points and rects were created with new(dim), not new(dim, coord)!
    rt_query_t(rt_oper oper, rt_arg arg, void *val) : oper(oper), argType(arg), value(val) {}
    ~rt_query_t() ;

    rt_oper oper;
    rt_arg argType;
    void *value;

class rt_ext_t : public gist_unordered_ext_t {

    const int numLvl = 2; // only 2 levels to distinguish: leaf/non-leaf

    typedef bool (*CmpFct)(const void *, int, const void *);
        // any point/rect comparison function

    // function table: one for each operator, possible argument type
    // and leaf/internal level
    typedef CmpFct CmpFctTbl[numLvl][rt_query_t::rt_numarg][rt_query_t::rt_numoper];

    CmpFctTbl cmpFcts; // for consistent()

    // for penalty()/pickSplit()/union_key():
    // expand rect with rect/pt
    typedef void (*ExpandFct)(rt_rect &r, const void *item, int len); 
    ExpandFct expand;

    // for pickSplit()
    typedef double (*DistanceFct)(const void *item1, const void *item2, int len);
    DistanceFct dist; // compute the distance between two rects/pts
    typedef int (*DimFct)(int dataSz); // returns dimension of points/rects
    DimFct dim;
    typedef void (*RectFct)(rt_rect &r, const void *item, int len);
    RectFct copyRect; // creates rect from pt/rect

        CmpFctTbl tbl,
        ExpandFct exp,
        DistanceFct distance,
        DimFct dim,
        RectFct rf);

    bool consistent(
        const void  *query,
        const void *pred,
        int predLen,
        int level);

    void penalty(
        const void *subtreePred,
        int predLen,
        int level,
        const void *newKey,
        int keyLen,
        gist_penalty_t &p);

    void union_key(
        void *bp,
        int &len,
        int level,
        const void *newPred,
        int newLen,
        bool &changed);

    void union_page(
        void *bp,
        int &len,
        int level, 
        gist_predcursor_t &pcursor,
        bool &changed);

    void pickSplit(
        gist_predcursor_t &pcursor,
        int level,
        int rightEntries[] /*out*/,
        int &numRightEntries /*out*/,
        const void *oldBp,
        int oldLen,
        void *leftBp /*out*/,
        int &leftLen /*in/out*/,
        void *rightBp /*out*/,
        int &rightLen /*int/out*/);
    bool check(
        const void *bp, 
        int bplen, 
        const void *pred, 
        int predlen, 
        int level); 

    void printPred(
        const void *pred,
        int plen,
        int level); 

    void printData(
        const void *data,
        int dlen);


extern gist_unorderedn_ext_t rt_point_ext;
extern gist_unorderedn_ext_t rt_rect_ext;

Comments, questions and suggestions may be directed to

Last modified: $Date: 2001/03/21 22:01:41 $ by $Author: mct $.