GiST: A Generalized Search Tree for Secondary Storage

Introduction

In the beginning there was the B-tree. All database search trees since the B-tree have been variations on its theme. Recognizing this, we have developed a new kind of index called a Generalized Search Tree (GiST), which provides the functionality of all these trees in a single package. The GiST is an extensible data structure, which allows users to develop indices over any kind of data, supporting any lookup over that data. This package unifies a number of popular search trees in one data structure (the long list of potentials includes R-trees, B+-trees, hB-trees, TV-trees, Ch-Trees, partial sum trees, ranked B+-trees, and many, many others), eliminating the need to build multiple search trees for handling diverse applications. In addition to unifying all these structures, the GiST has one key feature that previous trees lacked: both data and query extensibility.

Query Extensibility

Previous search trees have been extensible in terms of the data they handle. For example, POSTGRES supports extensible B+-trees and R-trees. That means that you can use POSTGRES to build a B+-tree or R-tree over any data type you want. But B+-trees only support range predicates (<, = >), and R-trees only support n-d range queries (Contains, Contained, Equals). So if you index, say, a bunch of movies with a POSTGRES B+-tree, you can only ask queries like "Find all movies < T2". While this query might mean something (e.g. less-than could mean "less expensive", "less cool", or "less violent"), it's not really a natural thing to ask about movies. Instead you'd like to ask movie-specific queries like "find all movies with cool explosions", "find all movies with Arnold", or "find all movies with motorcycle chases". Such queries cannot be directly supported in a B+-tree, R-tree, or any other known structure except a GiST.

By contrast, you can program the GiST to support any query predicate, including the cool_explosions and other predicates suggested above. All it takes to get the GiST up and running is to implement 4 user-defined methods, which define the behavior of keys in the tree. Of course these methods have to be pretty fancy to support fancy queries, but for all the standard queries (a la B-trees, R-trees, etc.) they're trivial. In short, the GiST combines a new dimension of extensibility along with generality, code reuse, and a nice clean interface.

The Keys to the GiST

The GiST is a balanced tree structure like a B-tree, containing <key, pointer> pairs. But keys in the GiST are not integers like the keys in a B-tree. Instead, a GiST key is a member of a user-defined class, and represents some property that is true of all data items reachable from the pointer associated with the key. For example, keys in a B+-tree-like GiST are ranges of numbers ("all data items below this pointer are between 4 and 6"); keys in an R-tree-like GiST are bounding boxes, ("all data items below this pointer are in Calfornia"); keys in an RD-tree-like GiST are sets ("all data items below this pointer are subsets of {1,6,7,9,11,12,13,72}"); etc. To make a GiST work, you just have to figure out what to represent in the keys, and then write 4 methods for the key class that help the tree do insertion, deletion, and search.

The 4 Methods

Here are the 4 methods required of the key class to make a GiST work:

Consistent: This method lets the tree search correctly. Given a key p on a tree page, and user query q, the Consistent method should return NO if it is certain that both p and q cannot be true for a given data item. Otherwise it should return MAYBE.
Union: This method consolidates information in the tree. Given a set S of entries, this method returns a new key p which is true for all the data items below S. A simple way to implement Union is to return a predicate equivalent to the disjunction of the keys in S, i.e. "p₁ or p₂ or p₃ or...".
Penalty: Given a choice of inserting a new data item in a subtree rooted by entry <p, ptr>, return a number representing how bad it would be to do that. Items will get inserted down the path of least Penalty in the tree.
PickSplit: As in a B-tree, pages in a GiST occasionally need to be split upon insertion of a new data item. This routine is responsible for deciding which items go to the new page, and which ones stay on the old page.

That's it! There are some optional additional methods that can enhance performance. These are described in the original paper on the data structure.

Towards a Theory of Indexability

The simplest version of the Consistent method above is a routine that always returns MAYBE. Clearly such a tree would not be very efficient during search. For some datasets and queries it appears that no implementation of the methods can guarantee good search time for all desired queries.

We are working to develop a mathematical methodology which (in analogy with decidability and tractability) would evaluate rigorously the power and limitations of indexing techniques in diverse contexts. What differentiates such a theoretical approach to indexing from complexity theory and the theory of in-memory data structures is its emphasis on the secondary storage nature of indexing schemes, and on the two aspects that determine their cost and feasibility: storage utilization and disk accesses. We envision a rigorous analysis of data structures that takes into account the storage requirements and the total delay due to block accesses. Such analysis would ignore, for example, the CPU time necessary for determining the blocks to be accessed.

Access methods must be evaluated in the context of a particular workload. A workload consists of an instance (a finite subset of some domain), together with a set of queries (a given set of subsets of the instance). In one-dimensional indices such as B-trees, for example, the instance is some linearly ordered set, and the most common queries considered are range queries, that is to say, intervals of this order. Other common workloads are multi-dimensional point sets with range queries, powersets with intersection or inclusion queries, and so on. The workload plays in indexability theory the role played by the partially recursive languages in complexity or decidability theory: it is the unit whose complexity must be characterized; more accurately, the analog of a language is a family of workloads, one for each cardinality of the instance. Such growing families of workloads allow us to focus on asymptotic analysis and ignore additive constants. For each workload we have a space of possible indexing schemes (the analog of algorithms that partially decide the language). Such an indexing scheme is a collection of blocks, each of some size B that we assume fixed and very large (in the hundreds or thousands). Each block contains B elements of the instance, and the union of the blocks exhausts the instance. Each query is now answered by retrieving a set of blocks, whose union is a superset of the query.

We have results showing interesting upper and lower bounds for the storage utilization and disk accesses of range queries in n dimensions, as well as for subset/superset queries over powersets.