*In the beginning there was the B-tree. *All database search trees since
the B-tree have been variations on its theme. Recognizing this, we have
developed a new kind of index called a Generalized Search Tree (GiST),
which provides the functionality of all these trees in a single package. The
GiST is an extensible data structure, which allows users to develop indices
over any kind of data, supporting any lookup over that data. This package
unifies a number of popular search trees in one data structure (the long list
of potentials includes R-trees, B+-trees, hB-trees, TV-trees, Ch-Trees,
partial sum trees, ranked B+-trees, and many, many others), eliminating
the need to build multiple search trees for handling diverse applications. In
addition to unifying all these structures, the GiST has one key feature that
previous trees lacked: both data and *query *extensibility.

Previous search trees have been extensible in terms of the data they handle. For example, POSTGRES supports extensible B+-trees and R-trees. That means that you can use POSTGRES to build a B+-tree or R-tree over any data type you want. But B+-trees only support range predicates (<, = >), and R-trees only support n-d range queries (Contains, Contained, Equals). So if you index, say, a bunch of movies with a POSTGRES B+-tree, you can only ask queries like "Find all movies < T2". While this query might mean something (e.g. less-than could mean "less expensive", "less cool", or "less violent"), it's not really a natural thing to ask about movies. Instead you'd like to ask movie-specific queries like "find all movies with cool explosions", "find all movies with Arnold", or "find all movies with motorcycle chases". Such queries cannot be directly supported in a B+-tree, R-tree, or any other known structure except a GiST.

By contrast, you can program the GiST to support any query predicate,
including the `cool_explosions` and other predicates suggested above.
All it takes to get the GiST up and running is to implement 4 user-defined
methods, which define the behavior of keys in the tree. Of course these
methods have to be pretty fancy to support fancy queries, but for all the
standard queries (a la B-trees, R-trees, etc.) they're trivial. In short, the
GiST combines a new dimension of extensibility along with generality,
code reuse, and a nice clean interface.

The GiST is a balanced tree structure like a B-tree, containing <key, pointer> pairs. But keys in the GiST are not integers like the keys in a B-tree. Instead, a GiST key is a member of a user-defined class, and represents some property that is true of all data items reachable from the pointer associated with the key. For example, keys in a B+-tree-like GiST are ranges of numbers ("all data items below this pointer are between 4 and 6"); keys in an R-tree-like GiST are bounding boxes, ("all data items below this pointer are in Calfornia"); keys in an RD-tree-like GiST are sets ("all data items below this pointer are subsets of {1,6,7,9,11,12,13,72}"); etc. To make a GiST work, you just have to figure out what to represent in the keys, and then write 4 methods for the key class that help the tree do insertion, deletion, and search.

Here are the 4 methods required of the key class to make a GiST work:

**Consistent:**This method lets the tree search correctly. Given a key**p**on a tree page, and user query**q**, the Consistent method should return**NO**if it is certain that both**p**and**q**cannot be true for a given data item. Otherwise it should return**MAYBE**.**Union:**This method consolidates information in the tree. Given a set**S**of entries, this method returns a new key**p**which is true for all the data items below**S**. A simple way to implement**Union**is to return a predicate equivalent to the disjunction of the keys in**S**, i.e. "**p**_{1}or**p**_{2}or**p**_{3}or...".**Penalty:**Given a choice of inserting a new data item in a subtree rooted by entry**<p, ptr>**, return a number representing how bad it would be to do that. Items will get inserted down the path of least**Penalty**in the tree.**PickSplit:**As in a B-tree, pages in a GiST occasionally need to be split upon insertion of a new data item. This routine is responsible for deciding which items go to the new page, and which ones stay on the old page.

That's it! There are some optional additional methods that can enhance performance. These are described in the original paper on the data structure.

The simplest version of the **Consistent** method above is a routine that
*always* returns **MAYBE**. Clearly such a tree would not be very efficient
during search. For some datasets and queries it appears that *no*
implementation of the methods can guarantee good search time for all
desired queries.

We are working to develop a
mathematical methodology which (in analogy with decidability and tractability)
would evaluate rigorously the power and limitations of indexing techniques in
diverse contexts.
What differentiates such a theoretical approach to indexing from complexity
theory and the theory of in-memory data structures is its emphasis on the
*secondary storage* nature of indexing schemes, and on the two aspects that
determine their cost and feasibility: *storage utilization and disk
accesses*. We envision a rigorous analysis of data structures that takes into
account the storage requirements and the total delay due to block accesses.
Such analysis would ignore, for example, the CPU time necessary for determining
the blocks to be accessed.