A Guided Tour of Sample Code
The libGiST package is a C++ library that provides the basic GiST functionality.
By itself it does nothing, since some of the important methods are pure
virtual. However, with a little coding on your part, you can make it
index data in a variety of ways, and support just about any query.
We have provided some examples to get you started. The RTree directory
contains code which uses libGiST as the basis for a simple R-tree-like
storage/retrieval system that indexes 2-dimensional box data, and supports
natural geographic queries. The RSTree directory provides the same
functionality as the RTree directory, but uses the more efficient (and
complex) R*-tree techniques invented by Beckman, et al. [BKSS90].
The BTree directory contains code which uses libGiST to implement a simple
B+-tree storage/retrieval system that indexes textual data, to support
alphabetic equality and range queries.
As an introduction to libGiST, we will step through the header files
for each of these small projects, which will introduce you to increasingly
powerful features of libGiST. We will go through the simplest project,
RTree, in some detail, and then just describe the features of RSTree and
BTree that are different from RTree. Probably the easiest way to
use libGiST is to cannibalize some of this sample code, so it's worth looking
at carefully.
RTree
R-trees were first proposed by Guttman [Gut84] to index geographic data
(points, lines, and polygons), in order to efficiently find data that satisfy
geographic predicates (i.e., key {contains,contained-in,overlaps,equals}
constant). The keys in a (2D) R-tree are "bounding boxes",
i.e. rectangles which are aligned with the x and y axes. A key signifies
that all objects stored below it fall within that geographic area. If you
want more detail on R-trees, Guttman's paper makes for fairly accessible
reading.
In order to extend libGiST to support R-tree behavior, we need four
of the basic GiST methods mentioned in the original GiST paper:
- Consistent: compares a query rectangle and a key rectangle to see if
they correspond in a way appropriate to the query predicate. One tricky
bit in R-trees is that entries on internal nodes are treated differently
than entries at the leaves. (RTpredicate::Consistent(const RTpredicate
&)in RTpredicate.h).
- Union: given a set of geographic objects, finds the minimal bounding
box containing them all. (GiSTentry *RTnode::Union() in
RTnode.h)
- Penalty: returns a GiSTPenalty (just a double, really)
representing the "badness" of inserting a new entry below this
entry -- for R-trees, this is measured by the increase in area of this
entry's bounding box required to accomodate the new entry. (GiSTpenalty
*RTentry::Penalty(const GiSTentry) in RTentry.h)
- PickSplit: Guttman proposes 3 different algorithms in his paper; we
use his polynomial-time algorithm in our implementation. ( GiSTnode
*RTnode::PickSplit() in RTnode.h)
We will also need various utility methods like constructors and destructors
for our RTree classes, which inherit most (but not all!) of their behavior
from libGiST. We won't talk about the constructors and destructors, which
are pretty straightforward, but the rest of the simple utility methods
are as follows (organized by file):
- RT.h: contains the RT class for R-trees, which inherits from
the GiST class of libGiST
- GiSTobjid RT::IsA(): Not strictly necessary, but useful
for debugging. All classes which inherit from the GiSTobject class
can define an IsA method to identify themselves. The choice of
object tags is in libGiST/GiSTdefs.h if you wish to extend it.
- GiSTnode *RT::CreateNode(): This constructor method is
critical, as it ensures that our trees are actually instances of
the RT class, and hence will use the appropriate RT methods. N.B.: Your
programs will compile but probably not work if you don't define
constructors and other basic allocation utilties for all your classes!!
This is because libGiST makes heavy use of the virtual function mechanism
of C++, which requires that objects be allocated within their appropriate
subclasses.
- GiSTstore *RT::CreateStore(): this method identifies and passes
control to the storage subsystem of your choice. libGiST is distributed
with one storage subsystem, GiSTfile, which uses the file system to store
indices. Smarter storage systems (e.g. DBMSs of various sorts, or
buffer management packages) may be plugged in via inheritance to improve
the speed and/or functionality of your code.
- RTnode.h: defines the RTnode class for nodes in an R-tree,
which inherits from GiSTnode
- GiSTobjid RTnode::IsA(): as above
- GiSTobject *RTnode::Copy(): as above, crucial to the
correctness of your code.
- GiSTnode *RTnode::PickSplit() : One of the basic GiST methods.
Note that the splitting is effected by actually deleting entries from *this,
and putting them on the page which is returned by the method.
- GiSTentry *RTnode::Union(): another of the basic GiST methods
described above
- GiSTentry *RTnode::CreateEntry(): used to generate an entry
of the appropriate subclass of GiSTentry, which in this case is
RTentry. Like the Copy methods, this
is crucial!
- int RTnode::FixedLength(): tells libGiST two things -- that
the keys in our nodes have a fixed length, and what that length is. If
our entries were of variable length (as will be the case in BTree below),
this method would return 0.
- RTentry.h: defines the RTentry clas for entries in an R-tree,
which inherits from GiSTnode. Also defines the RTkey
class, which represents bounding boxes. N.B.: The GiSTentry
class contains a member GiSTobject *key, which is used to hold
keys. So our RTkey class must inherit from GiSTobject.
- RTkey::Copy(): again, critical.
- A variety of bounding-box methods, which are relatively self-explanatory.
- RTkey &RTentry::Key(): casts the key member, which is
of class GiSTobject to class RTkey. Used to prevent
compiler warnings.
- GiSTobject *RTentry::Copy(): as above, you can't do without
it.
- void RTentry::InitKey(): this method allocates a new key for
the entry -- like the Copy methods, it's crucial that it return
an instance of the appropriate key subclass (in this case, RTkey).
- GiSTpenalty * RTentry::Penalty(const GiSTentry &newEntry):
as discussed above
- int RTentry::CompressedLength(): since libGiST supports compressed
keys (discussed in BTree below), this method is required to calculate the
length of this key (not the entry) when compressed. In this case
it's just sizeof(RTkey).
- A variety of other methods for accessing/setting properties of the
key.
- RTpredicate.h: Home of the RTpredicate class, which contains
the Consistent method. In addition, RTpredicate has two
important members, oper and value. The former distinguishes
between the different box-comparison operators (Contains, Contained,
Overlap, Equal) and the latter holds the bounding box to which we're
comparing.
A quick look at the .cpp files will show that there's little more in
them than what's described above -- the only method of any complexity is
the PickSplit algorithm, which is taken directly from Guttman's paper.
RSTree
The R*-tree is very much like the R-tree, having identical keys and
supporting identical query predicates. But it contains some optimizations
that often make t outperform simple R-trees during lookup, by paying for
some additional complexity during insertion. First, it has different Penalty
and PickSplit methods. Second, it uses a technique known as Forced Reinsert
to keep the tree fuller and better balanced than an R-tree.
Forced reinsert works as follows -- when an entry is inserted into a
full node, some percentage of the entries on that node are deleted, and
reinserted into the tree. If the node is at a non-leaf level, the reinsertion
places the deleted entries back into nodes at that level. If reinsertion
discovers a full node, then it splits it as usual. This sporadic reinsertion
of elements accomodates changes in the data distribution as the tree fills
in.
Forced reinsert is supported by libGiST, and the RSTree implementation
simply takes advantage of it. The PickSplit and Penalty methods for
RSTree are a bit complicated, as proposed by the inventors of the R*-tree.
- RT.h:
- int RT::ForcedReinsert(): this method is overriden here to
return 1 rather than 0, turning on libGiST's forced reinsert logic.
- GiSTlist<GiSTentry*> RT::RemoveTop(GiSTnode *node):
this method deletes the top k% of entries on node, where k = GiST::RemoveRatio(),
and returns them in a linked list. Beckmann et al. recommend removing the
k% furthest from the center of the bounding box of the node. The default
behavior in GiST::RemoveTop is just to remove the first k% of
entries from the left of the node, so we overload it here to use Beckmann,
et al's technique.
- RTentry.h:
- class RTpenalty: the standard GiSTPenalty class is
just a container for a double. The R*-tree paper specifies a way of resolving
ties if two penalties are equal, and yet another technique for resolving
the tie if the first tie-breaker doesn't help. In order to support this,
the RTpenalty class contains 3 doubles, and its comparison methods
break ties accordingly.
These are the only significant differences between RSTree and RTree.
BTree
B+-trees have 3 fundamental differences from R-trees:
- the keys in a B+-tree are in ascending order from left to right on
each page.
- range scans in a B+-tree do not traverse the tree; rather, they find
the leftmost item and scan across the leaves of the tree until they exceed
the right boundary of the range.
- in a B+-tree node, there are 1 fewer keys than there are pointers
The first of these two issues are handled in our BTree implementation
by informing libGiST that the tree has the IsOrdered property. This
is described below. The third of these issues is handled via key
compression: two virtual methods of GiSTentry -- Compress and Decompress
-- were not used in our RTree implementation, but are added for BTree.
They allow keys to be compressed on disk to take up less space and improve
fanout. When keys are read off disk, they are decompressed, and before
they are written back, they are compressed. The BTree implementation essentially
compresses the first key on a node to nothing, which effectively gives
the standard B+-tree layout of n-1 keys for n pointers.
We now describe the salient parts of the BTree implementation that make
it differ from the RTree and RSTree implementations. First, of course,
the keys in a BTree are different -- in uncompressed format, they are alphabetic
ranges of words (our demo works on textual data -- see the BTkey
class in BTentry.h). The operators that can appear in BTpredicates are
the standard ordering operators: <, <=, =, >=, >; otherwise,
BTpredicates are very similar to RTpredicates. Further differences
are described by file:
- BT.h:
- BT::IsOrdered(): this method, inherited from the GiST
class, defaults to returning 0, signifying FALSE. In the case of BTree,
it returns 1, signifying TRUE, to tell libGiST to ensure that keys are
ordered on each page, and that range queries are handled without the usual
depth-first traversal1.
- BTnode.h
- BTnode::Insert(const GiSTentry &entry): This routine includes
an assertion for debugging purposes, and then calls its ancestor method
GiSTnode::Insert. You could remove this descendant method
without harm, but it's better to leave it in to prevent problems after
modification of the code.
- BTnode::InsertBefore(const GiSTentry &entry, int index):
This routine requires some explanation. Its ancestor, GiSTnode::InsertBefore,
simply inserts entry entry onto this node before the entry
at the index'th spot, shuffling that entry and its neighbors over
to the right. However, BTnode::InsertBefore overrides this behavior
with some extra detail.
In uncompressed format, BTree keys are pairs of words demarcating a
range, closed on the left end and open on the right (e.g. the range ["albacore",
"apple") contains albacore, all the words between albacore
and apple, but not apple). The leftmost entry (at index 0) has a left end
of -infinity, and the rightmost entry has a right end of infinity.
All other entries have a left end equal to their left neighbor's right
end, and a right end equal to their right neighbors left end. BTnode::InsertBefore
ensures that these properties remain true after insertion.
(N.B.: The above is not true for leaf nodes, which have keys
that are ranges containing only the value of the item, e.g. ["albacore",
"albacore). This is why GiSTnode::Insert works for leaf
nodes but not for internal nodes.)
- BTnode::Coalesce(const GiSTnode& node, const GiSTentry&
entry): This method is also overridden to ensure the properties above.
We are moving entries from node, which is to our right, to this.
If this is a non-empty internal page, the rightmost entry on this
has an upperbound of +inf, and the leftmost entry on node has
a lower bound of -inf. This code takes care of that -- both these values
should be set to the right bound of entry (our parent entry).
- BTentry.h
- class BTkey is a simple string class, with special cases for
representing negative and positive infinity
- class BTintkey is a simple class for pairs of BTkeys,
representing a range. It is the key type for BTentry.
- GiSTcompressedEntry BTEntry::Compress(): compresses a key
and copies the pointer into the result. In this case, if it's the leftmost
entry on the node it compresses down to 0 bytes. Any other entry
compresses down to its lower boundary.
- void BTEntry::Decompress(const GiSTcompressedEntry entry):
decompresses a key and copies the pointer into the result. In this case,
it reverses the logic of compress by peeking at the neighbor to the right
to get the upper bound. If this is the leftmost entry, the lower
bound is -inf.
Otherwise there should be no great surprises in the BTree code.
Writing your own methods
If you've made it this far, you should be ready to develop your own
indexing code using libGiST. If you have ordered data and mostly
are supporting range predicates, you should probably start with the BTree
sample code and modify it to suit. If you have unordered data, it's
probably best to start with the RTree code, and when you get things working,
try adding Forced Resinsert (a la RSTree) and see if it helps.
Footnotes
1In v0.9b1 range queries are handled depth-first even if
IsOrdered() returns 1. This is a bug, which we hope to fix in
a subsequent release.
Bibliography
[Gut84] Antonin Guttman. "R-Trees: A Dynamic Index Structure for Spatial
Searching". In Proc. ACM SIGMOD, pp. 47-57, June, 1984.
[HNP95] Joseph M. Hellerstein, Jeffrey F. Naughton and Avi Pfeffer.
"Generalized Search Trees for Database Systems". In Proc. 21st Int. Conf on
Very Large Data Bases (VLDB), pp. 562-573.
[BKSS90] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider,
and Bernhard Seeger. "The R*-tree: An Efficient and Robust Access Method For
Points and Rectangles". In Proc. ACM SIGMOD, pp. 322-331, June 1990.