A Guided Tour of Sample Code

The libGiST package is a C++ library that provides the basic GiST functionality. By itself it does nothing, since some of the important methods are pure virtual. However, with a little coding on your part, you can make it index data in a variety of ways, and support just about any query.

We have provided some examples to get you started. The RTree directory contains code which uses libGiST as the basis for a simple R-tree-like storage/retrieval system that indexes 2-dimensional box data, and supports natural geographic queries. The RSTree directory provides the same functionality as the RTree directory, but uses the more efficient (and complex) R*-tree techniques invented by Beckman, et al. [BKSS90]. The BTree directory contains code which uses libGiST to implement a simple B+-tree storage/retrieval system that indexes textual data, to support alphabetic equality and range queries.

As an introduction to libGiST, we will step through the header files for each of these small projects, which will introduce you to increasingly powerful features of libGiST. We will go through the simplest project, RTree, in some detail, and then just describe the features of RSTree and BTree that are different from RTree. Probably the easiest way to use libGiST is to cannibalize some of this sample code, so it's worth looking at carefully.

RTree

R-trees were first proposed by Guttman [Gut84] to index geographic data (points, lines, and polygons), in order to efficiently find data that satisfy geographic predicates (i.e., key {contains,contained-in,overlaps,equals} constant). The keys in a (2D) R-tree are "bounding boxes", i.e. rectangles which are aligned with the x and y axes. A key signifies that all objects stored below it fall within that geographic area. If you want more detail on R-trees, Guttman's paper makes for fairly accessible reading.

In order to extend libGiST to support R-tree behavior, we need four of the basic GiST methods mentioned in the original GiST paper:

Consistent: compares a query rectangle and a key rectangle to see if they correspond in a way appropriate to the query predicate. One tricky bit in R-trees is that entries on internal nodes are treated differently than entries at the leaves. (RTpredicate::Consistent(const RTpredicate &)in RTpredicate.h).
Union: given a set of geographic objects, finds the minimal bounding box containing them all. (GiSTentry *RTnode::Union() in RTnode.h)
Penalty: returns a GiSTPenalty (just a double, really) representing the "badness" of inserting a new entry below this entry -- for R-trees, this is measured by the increase in area of this entry's bounding box required to accomodate the new entry. (GiSTpenalty *RTentry::Penalty(const GiSTentry) in RTentry.h)
PickSplit: Guttman proposes 3 different algorithms in his paper; we use his polynomial-time algorithm in our implementation. ( GiSTnode *RTnode::PickSplit() in RTnode.h)

We will also need various utility methods like constructors and destructors for our RTree classes, which inherit most (but not all!) of their behavior from libGiST. We won't talk about the constructors and destructors, which are pretty straightforward, but the rest of the simple utility methods are as follows (organized by file):

RT.h: contains the RT class for R-trees, which inherits from the GiST class of libGiST

GiSTobjid RT::IsA(): Not strictly necessary, but useful for debugging. All classes which inherit from the GiSTobject class can define an IsA method to identify themselves. The choice of object tags is in libGiST/GiSTdefs.h if you wish to extend it.
GiSTnode *RT::CreateNode(): This constructor method is critical, as it ensures that our trees are actually instances of the RT class, and hence will use the appropriate RT methods. N.B.: Your programs will compile but probably not work if you don't define constructors and other basic allocation utilties for all your classes!! This is because libGiST makes heavy use of the virtual function mechanism of C++, which requires that objects be allocated within their appropriate subclasses.
GiSTstore *RT::CreateStore(): this method identifies and passes control to the storage subsystem of your choice. libGiST is distributed with one storage subsystem, GiSTfile, which uses the file system to store indices. Smarter storage systems (e.g. DBMSs of various sorts, or buffer management packages) may be plugged in via inheritance to improve the speed and/or functionality of your code.

RTnode.h: defines the RTnode class for nodes in an R-tree, which inherits from GiSTnode

GiSTobjid RTnode::IsA(): as above
GiSTobject *RTnode::Copy(): as above, crucial to the correctness of your code.
GiSTnode *RTnode::PickSplit() : One of the basic GiST methods. Note that the splitting is effected by actually deleting entries from *this, and putting them on the page which is returned by the method.
GiSTentry *RTnode::Union(): another of the basic GiST methods described above
GiSTentry *RTnode::CreateEntry(): used to generate an entry of the appropriate subclass of GiSTentry, which in this case is RTentry. Like the Copy methods, this is crucial!
int RTnode::FixedLength(): tells libGiST two things -- that the keys in our nodes have a fixed length, and what that length is. If our entries were of variable length (as will be the case in BTree below), this method would return 0.

RTentry.h: defines the RTentry clas for entries in an R-tree, which inherits from GiSTnode. Also defines the RTkey class, which represents bounding boxes. N.B.: The GiSTentry class contains a member GiSTobject *key, which is used to hold keys. So our RTkey class must inherit from GiSTobject.

RTkey::Copy(): again, critical.
A variety of bounding-box methods, which are relatively self-explanatory.
RTkey &RTentry::Key(): casts the key member, which is of class GiSTobject to class RTkey. Used to prevent compiler warnings.
GiSTobject *RTentry::Copy(): as above, you can't do without it.
void RTentry::InitKey(): this method allocates a new key for the entry -- like the Copy methods, it's crucial that it return an instance of the appropriate key subclass (in this case, RTkey).
GiSTpenalty * RTentry::Penalty(const GiSTentry &newEntry): as discussed above
int RTentry::CompressedLength(): since libGiST supports compressed keys (discussed in BTree below), this method is required to calculate the length of this key (not the entry) when compressed. In this case it's just sizeof(RTkey).
A variety of other methods for accessing/setting properties of the key.

RTpredicate.h: Home of the RTpredicate class, which contains the Consistent method. In addition, RTpredicate has two important members, oper and value. The former distinguishes between the different box-comparison operators (Contains, Contained, Overlap, Equal) and the latter holds the bounding box to which we're comparing.

A quick look at the .cpp files will show that there's little more in them than what's described above -- the only method of any complexity is the PickSplit algorithm, which is taken directly from Guttman's paper.

RSTree

The R*-tree is very much like the R-tree, having identical keys and supporting identical query predicates. But it contains some optimizations that often make t outperform simple R-trees during lookup, by paying for some additional complexity during insertion. First, it has different Penalty and PickSplit methods. Second, it uses a technique known as Forced Reinsert to keep the tree fuller and better balanced than an R-tree.

Forced reinsert works as follows -- when an entry is inserted into a full node, some percentage of the entries on that node are deleted, and reinserted into the tree. If the node is at a non-leaf level, the reinsertion places the deleted entries back into nodes at that level. If reinsertion discovers a full node, then it splits it as usual. This sporadic reinsertion of elements accomodates changes in the data distribution as the tree fills in.

Forced reinsert is supported by libGiST, and the RSTree implementation simply takes advantage of it. The PickSplit and Penalty methods for RSTree are a bit complicated, as proposed by the inventors of the R*-tree.

RT.h:

int RT::ForcedReinsert(): this method is overriden here to return 1 rather than 0, turning on libGiST's forced reinsert logic.
GiSTlist<GiSTentry*> RT::RemoveTop(GiSTnode *node): this method deletes the top k% of entries on node, where k = GiST::RemoveRatio(), and returns them in a linked list. Beckmann et al. recommend removing the k% furthest from the center of the bounding box of the node. The default behavior in GiST::RemoveTop is just to remove the first k% of entries from the left of the node, so we overload it here to use Beckmann, et al's technique.

RTentry.h:

class RTpenalty: the standard GiSTPenalty class is just a container for a double. The R*-tree paper specifies a way of resolving ties if two penalties are equal, and yet another technique for resolving the tie if the first tie-breaker doesn't help. In order to support this, the RTpenalty class contains 3 doubles, and its comparison methods break ties accordingly.

These are the only significant differences between RSTree and RTree.

BTree

B+-trees have 3 fundamental differences from R-trees:

the keys in a B+-tree are in ascending order from left to right on each page.
range scans in a B+-tree do not traverse the tree; rather, they find the leftmost item and scan across the leaves of the tree until they exceed the right boundary of the range.
in a B+-tree node, there are 1 fewer keys than there are pointers

The first of these two issues are handled in our BTree implementation by informing libGiST that the tree has the IsOrdered property. This is described below. The third of these issues is handled via key compression: two virtual methods of GiSTentry -- Compress and Decompress -- were not used in our RTree implementation, but are added for BTree. They allow keys to be compressed on disk to take up less space and improve fanout. When keys are read off disk, they are decompressed, and before they are written back, they are compressed. The BTree implementation essentially compresses the first key on a node to nothing, which effectively gives the standard B+-tree layout of n-1 keys for n pointers.

We now describe the salient parts of the BTree implementation that make it differ from the RTree and RSTree implementations. First, of course, the keys in a BTree are different -- in uncompressed format, they are alphabetic ranges of words (our demo works on textual data -- see the BTkey class in BTentry.h). The operators that can appear in BTpredicates are the standard ordering operators: <, <=, =, >=, >; otherwise, BTpredicates are very similar to RTpredicates. Further differences are described by file:

BT.h:

BT::IsOrdered(): this method, inherited from the GiST class, defaults to returning 0, signifying FALSE. In the case of BTree, it returns 1, signifying TRUE, to tell libGiST to ensure that keys are ordered on each page, and that range queries are handled without the usual depth-first traversal¹.

BTnode.h

BTnode::Insert(const GiSTentry &entry): This routine includes an assertion for debugging purposes, and then calls its ancestor method GiSTnode::Insert. You could remove this descendant method without harm, but it's better to leave it in to prevent problems after modification of the code.
BTnode::InsertBefore(const GiSTentry &entry, int index): This routine requires some explanation. Its ancestor, GiSTnode::InsertBefore,simply inserts entry entry onto this node before the entry at the index'th spot, shuffling that entry and its neighbors over to the right. However, BTnode::InsertBefore overrides this behavior with some extra detail.

In uncompressed format, BTree keys are pairs of words demarcating a range, closed on the left end and open on the right (e.g. the range ["albacore", "apple") contains albacore, all the words between albacore and apple, but not apple). The leftmost entry (at index 0) has a left end of -infinity, and the rightmost entry has a right end of infinity. All other entries have a left end equal to their left neighbor's right end, and a right end equal to their right neighbors left end. BTnode::InsertBefore ensures that these properties remain true after insertion.

(N.B.: The above is not true for leaf nodes, which have keys that are ranges containing only the value of the item, e.g. ["albacore", "albacore). This is why GiSTnode::Insert works for leaf nodes but not for internal nodes.)

BTnode::Coalesce(const GiSTnode& node, const GiSTentry& entry): This method is also overridden to ensure the properties above. We are moving entries from node, which is to our right, to this. If this is a non-empty internal page, the rightmost entry on this has an upperbound of +inf, and the leftmost entry on node has a lower bound of -inf. This code takes care of that -- both these values should be set to the right bound of entry (our parent entry).

BTentry.h

class BTkey is a simple string class, with special cases for representing negative and positive infinity
class BTintkey is a simple class for pairs of BTkeys, representing a range. It is the key type for BTentry.
GiSTcompressedEntry BTEntry::Compress(): compresses a key and copies the pointer into the result. In this case, if it's the leftmost entry on the node it compresses down to 0 bytes. Any other entry compresses down to its lower boundary.
void BTEntry::Decompress(const GiSTcompressedEntry entry): decompresses a key and copies the pointer into the result. In this case, it reverses the logic of compress by peeking at the neighbor to the right to get the upper bound. If this is the leftmost entry, the lower bound is -inf.

Otherwise there should be no great surprises in the BTree code.

Writing your own methods

If you've made it this far, you should be ready to develop your own indexing code using libGiST. If you have ordered data and mostly are supporting range predicates, you should probably start with the BTree sample code and modify it to suit. If you have unordered data, it's probably best to start with the RTree code, and when you get things working, try adding Forced Resinsert (a la RSTree) and see if it helps.

Footnotes

¹In v0.9b1 range queries are handled depth-first even if IsOrdered() returns 1. This is a bug, which we hope to fix in a subsequent release.

Bibliography

[Gut84] Antonin Guttman. "R-Trees: A Dynamic Index Structure for Spatial Searching". In Proc. ACM SIGMOD, pp. 47-57, June, 1984.

[HNP95] Joseph M. Hellerstein, Jeffrey F. Naughton and Avi Pfeffer. "Generalized Search Trees for Database Systems". In Proc. 21st Int. Conf on Very Large Data Bases (VLDB), pp. 562-573.

[BKSS90] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. "The R*-tree: An Efficient and Robust Access Method For Points and Rectangles". In Proc. ACM SIGMOD, pp. 322-331, June 1990.