- Introduction
- Related Work
- Preliminaries
- G-Trees
- Algorithms
- Conclusion
- References
- Appendix A: Code
- Appendix B: G-Tree Variants
- Appendix C: Explicit Insertion and Deletion

We describe G-trees, a family of randomized, history-independent search tree data structures. The G-trees encompass several independently-discovered data structures such as zip-trees, zip-zip-trees, and dense skip-trees. The family further contains novel trees of arity greater than two, which are significantly more efficient in the presence of cache hierarchies or block storage than zip-trees. Traditionally, such randomized trees have been significantly more complex than their binary counterparts, whereas our $k$-ary G-trees have no additional conceptual overhead at all. We generalize the zip and unzip operations of zip-trees to provide a uniform, simple, and efficient implementation technique for all members of our family of data structures.

Randomized set data structures eschew self-balancing logic for simpler, probabilistic item organization. When deriving the necessary randomness via pseudorandom functions of the stored items themselves, the resulting graphs depend on the stored set only, but not the order of insertions and deletions. This history-independence ensures that no information about previously deleted items can be reconstructed (Naor & Teague, 2001), and it enables efficient set fingerprinting (Pugh & Teitelbaum, 1989) when using the data structure as a merkle-tree (Merkle, 1989).

For these reasons, randomized search trees and related data structures have been studied for decades. The most prominent such data structures are treaps (Seidel & Aragon, 1996), skip-lists (Pugh, 1990), and, more recently, zip-trees (Tarjan et al., 2021). All three are different takes on approximating the distribution of items in perfectly balanced binary search trees.

While *binary* trees are highly efficient in theory, they are less efficient on actual hardware than trees that store more than one item per vertex. Unfortunately, generalizing binary randomized data structures to higher-arity counterparts has proven more difficult than in the case of deterministically self-balancing trees. Providing a simple such generalization is the impetus for our work. Figure 1 plots lookup performance of a zip-tree versus a 32-ary tree of ours; our tree is roughly twice as fast (note the logarithmic y-axis). On secondary storage, we can expect the performance difference to be even more pronounced.

We present the geometric search trees (G-trees), a family of randomized search trees that provides a unified perspective on several independently researched data structures, including zip-trees, zip-zip-trees (Gila et al., 2023), skip-trees (Messeguer, 1997), dense skip-trees (Spiegel & Reynolds Jr, 2009), merkle-search-trees (Auvolat & Taïani, 2019), and prolly-trees (Boodman et al., 2016). Our framework allows us to trivially define efficient trees that store a bounded number of items per vertex. These trees are arguably the first such randomized search trees whose conceptual complexity is just as low as that of their binary counterparts.

Our key insight is to take a byproduct of the usual definition of zip-trees, turn it into a defining property of its own, and to then generalize it. Zip-trees assign geometrically chosen ranks to their items, and use these ranks for probabilistic balancing. A consequence of their balancing mechanism is that certain sequences of items with colliding ranks form sorted linked lists. It turns out we can view and even *define* zip-trees as collections of such linked lists.

From this definition, which is based on arranging *sorted linked lists* in a certain fashion, it is only a small step to arranging *arbitrary set data structures* in the same fashion. This yields our G-trees, a family of trees that is parameterized over a secondary search data structure. Using simple linked lists as the underlying data structure yields the zip-trees. Recursively instantiating the G-trees with other G-trees yields a natural (and efficient) generalization of the zip-zip-trees. And finally, using linked lists of $k$ items per vertex yields a generalization of the zip-trees where each node can store up to $k$ items.

The G-trees not only define a unique tree shape for any set of items with associated ranks, they also offer a unified means of implementation. We provide generalizations of the zipping and unzipping algorithms of the original zip-trees, and use them to implement insertion and deletion. These general algorithms can be applied to all G-trees whose underlying set data structure supports splitting at arbitrary keys and joining two non-overlapping sets. The algorithms perform a number of splits or joins proportional to the number of distinct ranks in the tree, which is logarithmic in the total number of items with high probability.

We give an overview of related work in Section 2, before introducing preliminary definitions and notation in Section 3. We present the geometric trees in Section 4, including a thorough analysis and a an overview of old and novel G-trees. Section 5 describes efficient algorithms for mutating G-trees.

The practically-minded reader can safely skip ahead to Section 3, whereas the more academically inclined may wish to stick around for Section 2, where we outline the current state of the art and contextualize our contributions.Data structures whose exact shape is determined solely by their contents and not by the order of insertion and deletion operations have been studied for decades. This property has been given several names, such as unique representation (Snyder, 1977), structural unicity (Auvolat & Taïani, 2019), confluent persistence (Driscoll et al., 1994), and anti-persistence or history-independence (Naor & Teague, 2001). Deterministically self-balancing history-independent set data structures necessarily take super-logarithmic time to update under arbitrary insertions and deletions (Snyder, 1977). Hence, several *probabilistic* history-independent data structures have been devised which support membership queries and update operations in logarithmic time with high probability.

Well-known such (pseuso-) randomized set data structures include hash tries (as presented, for example, by Pugh and Teitelbaum (Pugh & Teitelbaum, 1989)), treaps (Seidel & Aragon, 1996), and skip-lists (Pugh, 1990). More recently, zip-trees (Tarjan et al., 2021) and zip-zip-trees (Gila et al., 2023) have been proposed as more efficient variants of skip-lists.

All these data structures approximate the vertex distribution of *binary* balanced search trees. While binary search trees are theoretically efficient, CPU caches or block-sized reads from secondary storage make it so that trees that store more than a single item per vertex outperform binary trees in practice. Theoretical models to capture this behavior in the analysis of algorithms and data structures include external memory models (Aggarwal & Vitter, 1988) and cache-oblivious models (Frigo et al., 1999).

Several attempts have been made to find randomized data structures that perform well in an external memory model. Golovin (Golovin, 2009) has proposed bushy treaps (B-treaps) to approximate the behavior of B-trees via treaps. Unfortunately, B-treaps are complicated enough that even their author recommends using simpler alternatives such as the B-skip-list (Golovin, 2010). The B-skip-list still involves a tuning parameter beyond the probability distribution for assigning node levels, a nontrivial invariant, and virtual memory management via hash tables rather than simple usage of pointers. In short, the conceptual complexity goes far beyond that of binary skip-lists or treaps.

Safavi and Seybold (Safavi & Seybold, 2023) propose randomized-block-search-trees (RBSTs) as another generalization of binary treaps that performs well in an external-memory model. The construction involves multiple tuning parameters, two distinct layers of data structure internals, and a highly nontrivial complexity analysis.

Bender et al (Bender et al., 2016) first specify a history-independent packed-memory array (PMA), and then build a history-independent B-tree analogon and an external-memory skip-list on top of the PMA. So there is again a two-layered aproach; the PMA introduces a significant chunk of conceptual complexity that is not part of regular treaps or skip-lists.

Some proponents of randomized data structures claim that a significant advantage over self-balancing data structures is their greater simplicity (Pugh, 1990; Seidel & Aragon, 1996). It seems safe to say that none of the external-memory constructions we have just listed retain this advantage.

All prior work that *does* achieve sufficient simplicity buys it at the price of vertices that must store a dynamic, unbounded number of items. The skip-trees (Messeguer, 1997) are the first such data structure, effectively converting a skip-list into a tree: each vertex stores sequences of items that are being skipped-over together in the corresponding skip-list. The relatively unknown dense skip-trees (Spiegel & Reynolds Jr, 2009) provide two optimizations: expected node size can be increased by flipping coins of success probabilities $k1 ,$ and empty vertices are eliminated from the tree. The merkle-search-trees (MSTs) (Auvolat & Taïani, 2019) independently reinvent skip-trees with flexible probabilities similar to dense skip-trees, but without the compression of empty vertices. Additionally, prolly-trees (Boodman et al., 2016) provide a slightly different take, where a rolling hash function over the keys/values is used to determine the set of items in each node, rather than via explicit split and join operations.

None of these data structures can provide a non-probabilistic upper bound on the number of items per vertex. This hampers efficient implementation; and adversarial data suppliers can trivially produce $n$ items in $O(n)$ expected time that must all be stored in the same vertex.

Here, we define fundamental terminology, and provide proper definitions and background for zip-trees.

A tree data structure for items from some universe $U$ is either the empty tree, or a vertex consisting of a sequence of $k−1$ items from $U$ and a sequence of $k$ trees called its children.

We write $t$.items for the items of $t$, and $t$.items[$i$] for the $i$-th item of $t$. We write $t$.children for the children of $t$, and $t$.children[$i$] for the $i$-th child of $t$. Indexing always starts at zero.

Let $t$ be a tree. The set of $t$, its children, *their* children, and so on, is called the set of subtrees of $t$. We say $t$ is of arity $k$ if all subtrees of $t$ have at most $k$ children.

If $t$ is a binary tree, we refer to its first child as its left child, and to its second child as its right child.

Let $⪯$ be a total order on $U$, and let $c$ be the number of children of $t$. Then $t$ is a search tree In a search tree, intuitively speaking, the items are sorted from left to right. That is, an in-order traversal yields a sorted sequence. (with respect to $U$) if it is the empty tree, or if

- $t$.items is sorted with respect to $U$,
- all items in $t$.children[$0$] are less than $t$.items[$0$],
- for all $0<i<c−1,$ all items in $t$.children[$i$] are less than $t$.items[$i+1$] and greater than $t$.items[$i−1$], and
- all items in $t$.children[$c−1$] are greater than $t$.items[$c−2$].

We call $t$ a heapIn a heap, intuitively speaking, no item is a descendant of a lesser item. if it is the empty tree, or if

Let $p$ be a probability.In plain language: take a coin that shows heads with probability $p$. Flip until it shows heads. Count the total number of flips. A geometric distribution $G(p)$ is a discrete probability distribution where the random variable $X$ takes on value $k∈N_{+}$ with probability $P(X=k)=p⋅(1−p)_{k−1}$. We can interpret $k$ as the outcome of a series of Bernoulli trials with success probability $p$, where the rank $k$ represents the total number of trials until (and including) a first success. As is customary, we often write $q:=1−p$ for the failure probability of the Bernoulli trial.

The expected value of $X$ is $E[X]=p1 =1−q1 $ (Forbes et al., 2011)11Most people would actually look this up on Wikipedia rather than in a textbook..

We often wish to pseudorandomly map items from some universe $U$ to geometrically distributed ranks. To this end, we require a rank function $rank_{p}:U→N$, such that $rank_{p}(u)$ for any $u∈U$ is drawn independently from a random geometric distribution $G(p)$. We simply write $rank(u)$ when $p$ is unimportant or clear from context.

In practice, $rank_{(1/2)}(u)$ can be implemented by hashing $u$ with a secure hash function and counting the number of trailing zeros in the binary representation of the digest. This can also be interpreted as the largest power of two that divides the digest of $u$, as used by Pugh and Teitelbaum (Pugh & Teitelbaum, 1989). Auvolat and Taïani (Auvolat & Taïani, 2019) generalize this construction to distributions $G(2_{k}1 )$ by counting trailing or leading zeroes in the base-$k$ representation of uniformly distributed pseudorandom integers. This is equivalent to counting all-zero groupings of $k$ successive digits in a binary representation.

We can now give a formal definition of zip trees (Tarjan et al., 2021). Let $rank_{21}$ be a rank function. The zip tree of some set $S$ is the unique search tree whose set of items is $S$, and which is a heap when ordering by rank first, item second. Equivalently, it must be a heap with respect to ranks such that left children always have strictly lesser rank than their parents. Figure 2 gives an example.The example tree is taken from a stackoverflow answer, the interested reader can find there a detailed description of the isomorphism to skip-lists for that same example tree.

We derive our G-trees by generalizing zip trees. To do so, we start with a non-standard definition of zip trees:

This recursive definition reveals an interesting property of zip trees: colliding ranks. From this definition, it is easy to see that the items of maximal rank form a linked list of right children at the root of the zip-tree. The same holds recursively in all subtrees as well. We can characterize these linked lists of items of equal rank as a sequence $Q=q0 ≺q1 ≺…qk $ of items colliding in a superset $S⊇Q$ if all item in $Q$ have equal rank, and there is no $s∈S$ of greater rank such that $q0 ≺s≺qk .$ Figure 3 visualizes the linked lists formed by maximal sets of colliding items in a zip tree.

We can even *define* zip trees in terms of maximal colliding item sequences, by using all items of maximal rank as partition items in a single construction step:

Figure 4 vizualizes this perspective on zip trees. Note how the linked-list pointers are exactly the pointers between items of equal rank, whereas both left and right child pointers always involve a strict decrease in rank. In that sense, the list-based definition fixes an asymmetry of the original definition, which only requires strictly decreasing ranks for left children.

From this non-standard characterization of zip-trees, there is a natural generalization: rather than representing maximal sequences of colliding items (and the items’ left subtrees) with sorted linked lists, we can represent them with arbitrary set data structures:

This generalization turns out to be remarkably powerful. We can express several well-known data structures as G-trees (Section 4.2), we find novel, cache-efficient data structures amongst the G-trees (Section 4.3), and we can derive efficient implementation techniques for all of them (Section 5).

We can regard any G-tree from two perspectives. First, as a high-level tree of maximal colliding sequences. We refer to the nodes of this conceptual tree as G-nodes. The height of this tree is equal to the number of distinct ranks of the underlying set, which is in $O(g(n))$ with high probability (we give an analysis in Section 4.1). Figure 5 visualizes our running example from this perspective.

And second, given a specific set data structure $S$, we can study the resulting in-memory graphs with pointers as edges. If $S$ is the type of sorted linked lists, for example, these concrete graphs are isomorphic to the zip trees (compare Figure 4, which we can interpret as depicting a G-tree using sorted linked lists).

When we select the rank of each item as a pseudorandom function of the item itself, a tree of G-nodes is uniquely determined by the set of its items. Hence, if $S$ is history-independent, so is the G-tree.

When interpreted as trees of G-nodes, the G-trees are exactly the dense skip-trees of Spiegel and Reynolds Jr (Spiegel & Reynolds Jr, 2009). Spiegel and Reynolds Jr never consider the internal structure of the G-nodes, however, thus missing the instantiations that we discuss later.

The math-averse reader can safely skip ahead to Section 4.2, as long as they believe us when we say that G-nodes form trees of logarithmic height with high probability, and that each G-node contains $O(1)$ items with high probability.We now give a formal analysis of the performance-related properties of G-trees. Roughly speaking, we show that G-trees with a geometric distribution of some probability $1−k1 $ are similar to perfectly balanced $k-ary$ trees with high probability: the height (in terms of G-nodes) stays within a constant factor of $g_{k}(n)$, and the maximal number of items per G-node stays within a constant factor of $k$.

Throughout our analyses, we let $k$ be a natural number greater than or equal to two; $k$ is the intended average number of children per G-node (leaving us with an intended average of $k−1$ items per G-node). We let $T$ be a G-tree of $n$ items. We then consider G-trees for the geometric distribution $G(p)$ with $p:=1−k1 $, and set $q:=1−p=k1 ,$ as this will yield the intended average G-node size.

The rank of any G-node is a geometric random variable with parameter $p=1−k1 =1−q$. The height of $T$ is upper-bounded by the rank of its root node, since each child has a strictly lesser rank than its parent.

The root rank, $MT $ is the maximum of the ranks of all items in $T$, i.e., it is the maximum of $n$ independent samples of the geometric distribution $G(p)$. While it is known that the expected value of $MT $ is roughly $g_{k}(n)+O(1)$, there is no simple closed form expression (Eisenberg, 2008; Szpankowski & Rego, 1990).

To explicitly bound $MT $, fix any $r≤⌈g_{k}(n)⌉.$ Then, the probability that the rank of a G-node $g$ is at least $r$ is at most $q_{r−1}$, i.e., $P(rank(g)≥r)≤q_{r−1}.$ By the union bound, the probability that there exists *some* G-node such that $rank(g)≥r$ is $P(∃g,rank(g)≥r)≤nq_{r−1},$ and so for some positive constant $c$, $P(MT ≥(c+1)g_{k}(n))≤n_{−c}$ (Golovin, 2010).

Even for $r>⌈g_{k}(n)⌉$, the tail probability contribution is relatively small; we can estimate it by simplifying the geometric series:$r=⌈log_{k}(n)⌉+1∑∞ np_{r−1}=np_{⌈log_{k}(n)⌉}≤np_{log_{k}(n)}=1−p1 =q1 =k.$Hence, $E(MT )≈⌈g_{k}(n)⌉,$ and $E(MT )$ is at most $⌈g_{k}(n)⌉+k$, with high probability.

By construction, the $height(T)$ of a geometric tree is less than or equal to its maximum rank $MT $. Like zip trees, geometric trees *compress* their depth where ranks are missing or skipped;While we don't bother with tight bounds on this compressed height, (Archibald et al., 2006) provides an expectation on the number of *distinct* ranks in a geometrically distributed random sample, which could provide slightly tighter bounds. as such, $E(height(T))≤E(MT )≤⌈g_{k}(n)⌉+k$. Thus we have that the height of a geometric tree is in $O(gn)$ with high probability. Figure 6 confirms our analysis with experimental data.

$n=100$ | $n=1,000$ | $n=10,000$ | $n=100,000$ | |
---|---|---|---|---|

$k=2$ | 8.0 (3.48) | 11.3 (4.00) | 14.7 (3.88) | 17.9 (3.27) |

$k=4$ | 4.2 (0.90) | 5.9 (0.98) | 7.5 (0.86) | 9.2 (0.92) |

$k=16$ | 2.4 (0.30) | 3.2 (0.21) | 4.1 (0.25) | 4.9 (0.29) |

$k=64$ | 1.8 (0.19) | 2.2 (0.19) | 3.0 (0.12) | 3.35 (0.24) |

$n=100$ | $n=1,000$ | $n=10,000$ | $n=100,000$ | |
---|---|---|---|---|

$k=2$ | 6.7 (0.75) | 9.9 (0.75) | 13.2 (0.72) | 16.7 (0.73) |

$k=4$ | 4.0 (0.51) | 5.6 (0.51) | 7.3 (0.44) | 9.0 (0.47) |

$k=16$ | 2.3 (0.24) | 3.2 (0.18) | 4.1 (0.25) | 4.9 (0.24) |

$k=64$ | 1.8 (0.18) | 2.2 (0.17) | 3.0 (0.11) | 3.3 (0.23) |

Figure 6: G-Tree Maximum Ranks

Average maximal ranks and heights (in G-nodes) of 200 randomly generated G-trees, for various combinations of $n$ and $k$. The numbers in parentheses give the variance.

Given the geometric distribution of (independent) ranks over input values $s∈S$, the expected number of values with rank $r$ is about $k$ times the expected number of values with rank $r+1$, i.e., $P(rank(s)>r∣rank(s)≥r)=q$. Recall that each G-node in a geometric tree contains a set of items with the same rank, and that the rank of the G-node is the maximum rank of its items. Thus, we can think of G-nodes within $T$ as disjoint subsets formed by splitting the (sorted) items of rank $r$ at values of rank $r+1$. In other words, the number of nodes with a given rank $r$ is a direct function of the number of G-nodes with rank $r+1$.

The above node layout implies that the size $∣g∣$ of a G-node is itself a random variable drawn from a geometric distribution, this time with *success* probability $k1 $ and support $[n]={1,…,n}$. The expected value is roughly $k$, and the observed size is at most $c$ times the expected value with probability at least $1−(1−k1 )_{ck}$. In other words, $P(∣g∣≥ck)≤(1−k1 )_{ck}$. Setting $c=⌈g_{k}(n)⌉$, and using $1−k1 ≤exp(−k1 )$, we have that$P(∣g∣≥⌈g_{k}(n)⌉k) ≤(1−k1 )_{⌈log_{k}(n)⌉k}≤exp(−k1 )_{⌈log_{k}(n)⌉k}=exp(−k1 ⌈g_{k}(n)⌉k) $which for large $n$, where $⌈g_{k}(n)⌉≈g_{k}(n)$, we get$exp(−k1 g_{k}(n)k)=exp(−g_{k}(n))=n_{−1}$This bound demonstrates that the probability that the size of any G-node deviates from $k$ decreases exponentially with $n$.

Figure 7 shows experimental measurements of average G-node sizes. For low $n$, nodes contain slightly too few items: items are spread-out, and there are not enough items to “fill up” all nodes. The variances are small; average G-node sizes are indeed as stable as our analysis suggests.

$n=100$ | $n=1,000$ | $n=10,000$ | $n=100,000$ | |
---|---|---|---|---|

$k=2$ | 1.9095 (0.0000334) | 1.9940 (0.0000162) | 1.9983 (0.0000006) | 1.9998 (0.0000000) |

$k=4$ | 3.6374 (0.0000063) | 3.9516 (0.0000036) | 3.9942 (0.0000012) | 4.0002 (0.0000008) |

$k=16$ | 12.1596 (0.0257947) | 15.2423 (0.0011409) | 15.9084 (0.0001009) | 15.9850 (0.0000016) |

$k=64$ | 30.6534 (0.0817688) | 56.3768 (0.0023357) | 62.6815 (0.0219051) | 63.7859 (0.0235732) |

Figure 7: G-Node Average Sizes

Average G-node sizes of 200 randomly generated G-trees, for various combinations of $n$ and $k$. The numbers in parentheses give the variance.

The total expected number of G-nodes in a G-tree is intuitively $E(∣T∣)=kn +1$. Given that $q=k1 $ this can also be expressed as $nq+1$. While this expectation is intuitive, it can be estimated more directly as the sum of the expected number of G-nodes at every possible rank (plus a root node). Since the number of G-nodes at rank $r$ is equal to the number of *items* having rank $r+1$, and since the rank assignments for items are independent, the number of G-nodes with rank $r$ in a G-tree can be modeled as a binomial random variable $X_{r}$, with parameters $n$ (the number of trials, i.e., the number of items) and $p=(1−q)q_{r}$ (the probability of success, i.e., the probability that a item has rank $r+1$).To see why, consider that a binomial random variable describes the number of successes in a fixed number of independent trials, where each trial has the same probability of success. Here each item is an independent trial, the “success” is the event that a item has rank $r$, and the probability of success is $(1−q)q_{r}$. Thus, the expected number of G-nodes at rank $r$ is $E(Xr )=n(1−q)q_{r}$, and by the linearity of expectation, the total expected number of G-nodes in a G-tree is a geometric series with sum$E(∣T∣)=r=1∑∞ n(1−q)q_{r}+1=n(1−q)r=1∑∞ q_{r}+1$and recognizing that for $∣q∣<1$ the rightmost infinite sum simplifies to $1−qq $, we get$E(∣T∣)=n(1−q)1−qq +1=nq+1.$To provide bounds, we can use the usual multiplicative form of a Chernoff bound for the sum of independent (but not identically distributed) Bernoulli random variables (Dubhashi & Panconesi, 2009)22Again, you'll probably want to look this up on Wikipedia rather than in a textbook.. Then, for any $c>0$$P(∣T∣≥(1+c)nq)≤((1+c)_{1+c}e_{c} )_{nq}$or perhaps more conveniently, for $0<c<1$ and recalling that $q=k1 $,$P(∣T∣>(1+c)kn )≤exp(3k−c_{2}n ).$This provides an exponentially small probability that the total number of G-nodes in a geometric tree is much more than $kn +1$, and thus the G-tree size is in $O(n)$. Figure 8 gives experimental measurements that confirm the analysis.

$n=100$ | $n=1,000$ | $n=10,000$ | $n=100,000$ | |
---|---|---|---|---|

$k=2$ | 52.21 (25.081) | 502.5 (257.2) | 5002 (2394.5) | 50001 (26339) |

$k=4$ | 27.51 (19.086) | 252.9 (191.70) | 2504 (1826.1) | 25004 (18939) |

$k=16$ | 8.33 (6.716) | 65.46 (57.22) | 628.2 (572.4) | 6255.3 (6113.3) |

$k=64$ | 3.29 (2.213) | 17.75 (15.25) | 159.16 (160.2) | 1564 (1486.8) |

Figure 8: G-Node Counts

Average number of G-nodes in 200 randomly generated G-trees, for various combinations of $n$ and $k$. The numbers in parentheses give the variance.

This analysis provides some intuition for the practical efficiency and scalability of G-trees: the expected number of G-nodes is linear in the number of items, and is very close to a balanced tree structure. By understanding the total number of nodes and their size distribution, we can optimize and parameterize G-trees to minimize disk accesses, align nodes with disk block sizes, and design effective caching and buffer management strategies. This ensures that the tree operations are performed with high I/O efficiency, reducing latency and increasing throughput. Moreover, it helps in predicting performance, identifying bottlenecks, and managing storage costs by ensuring efficient space allocation and utilization. Combined with the fixed size of the $k$-lists (see Section 4.3) this makes G-trees potentially well-suited for external memory models while retaining much of their simplicity in practice.

We now discuss how several well-known probabilistic data structures can be expressed as G-trees.

As we described in our derivation of the G-trees, instantiating G-trees with sorted linked lists yields the zip trees. Aside from mentioning zip trees here for the sake of completeness, we want to point out an interesting implementation detail: whereas zip trees store the rank of every item in its vertex, a sorted linked list G-tree only needs to store one rank per linked list that it contains, i.e., one rank per G-node.

Instantiation of G-trees requires a set data structure. G-trees *are* set data structures themselves. How about some recursion?

Instantiating G-trees with zip trees, i.e., with G-trees instantiated with sorted linked lists, yields exactly the zip-zip-trees. Zip-zip-trees (Gila et al., 2023) were initially introduced as a modification of the tie-breaking algorithm of zip trees. This modification is essentially an ad-hoc variation, whereas our description of zip-zip-trees highlights them as a (highly relevant) member of a family of G-trees obtained from recursive self-instantiation. In particular, once G-trees and sorted linked lists have been implemented, the difference between implementing zip trees and zip-zip-trees consists of a single type-level operation, whereas the algorithmic definition of zip-zip-trees requires manual adjustment of all tree manipulation algorithms.

More generally, we can instantiate G-trees with G-trees that are themselves recursively instantiated, to an arbitrary depth and choice of recursion anchor. Using sorted linked lists as recursion anchors yields a family of $zip_{k}$-trees whose first two members are the zip trees and the zip-zip-trees.

We end this section with a fun observation: if we restrict each nested tree structure in the $zip_{k}$-trees to store ranks in a single bit (i.e., we cap the geometric distribution at two), we obtain exactly the treaps with k-bit priorities. We doubt that this has any practical applications, but it shows that the family of recursively instantiated G-trees is interesting beyond just the zip-zip-trees.

Having shown that the G-trees encompass several useful and well-known data structures, we now give some members of the family that have *not* been independently described before. In particular, we tackle the problem of finding history-independent data structures that are efficient on secondary storage (or in terms of cpu caches) by storing up to $k$ items in a single vertex. Previous solutions (Bender et al., 2016; Golovin, 2009, 2010; Safavi & Seybold, 2023) all incur a significant overhead in terms of conceptual complexity compared to their binary counterparts. With G-trees, we merely need to change the underlying set datastructure $S$ to one that stores items in blocks of $k$.

A key intuition behind classic zip trees is that of choosing ranks from a geometric distribution with $p=1−21 =21 $ because this is the distribution of the height of a randomly chosen vertex in a perfectly balanced binary tree. For $k$-ary trees, the same intuition instructs us to draw ranks from a geometric distribution with $p=1−k1 ,$ as this is the distribution of heights in a perfectly balanced $k$-ary tree. Our analysis confirmes that — with high probability — the resulting trees are of logarithmic height and their G-nodes store $O(k)$ items.

The second key insight toward an efficient $k$-ary data structure is, paradoxically, that there is no need to be clever about it. Sorted linked lists are naïve, inefficient data structures, yet zip trees are efficient. We can be similarly naïve for our $k$-ary construction.

We use a sorted linked list in which every node stores up to $k$ items. We require all items to be stored as early in the list as possible; this is the simplemost way of achieving history-independence. In other words, the only node to store fewer than $k$ items is the final node. We call such a list a $k$-list (see Figure 9).

Instantiating G-trees with the $k$-lists and a geometric distribution of $p=1−1+k1 $ yields a family of data structures we call the $k$-zip-trees. Figure 10 depicts the $2$-zip-tree for our running example set.

The $k$-lists are inefficient data structures — inserting or deleting the first item requires $O(n)$ time in a $k$-list of $n$ items. The $k$-zip-trees are nevertheless efficient, because the expected size of the G-nodes, and hence, the expected length of the $k$-lists, is constant for any $p$. More efficient alternatives to $k$-lists exist, such as unrolled linked lists (Shao et al., 1994). These are not history-independent, however.

What is the expected height of a $k$-zip-tree? Asymptotically, we know it to be in $O(gn)$, since we have G-nodes of $O(gn)$ different ranks, each a list of $O(k)$ items (which has a length in $O(k)$). Figure 11 gives experimental measures for some specific $n$ and $k$. The height amplification we report is the height of each tree divided by the height of a perfectly-balanced $k+1$-ary tree on $n$ vertices.

$n=100$ | $n=1,000$ | $n=10,000$ | $n=100,000$ | |
---|---|---|---|---|

$k=1$ | 15.29 (5.61) | 26.59 (8.89) | 38.20 (9.47) | 50.58 (11.86) |

$k=3$ | 8.61 (2.40) | 14.92 (3.38) | 21.97 (4.41) | 29.13 (4.93) |

$k=15$ | 4.28 (0.98) | 8.06 (2.04) | 12.36 (2.43) | 16.66 (2.45) |

$k=63$ | 2.37 (0.24) | 5.05 (1.43) | 8.43 (1.69) | 12.08 (2.52) |

$n=100$ | $n=1,000$ | $n=10,000$ | $n=100,000$ | |
---|---|---|---|---|

$k=1$ | 2.18 (0.11) | 2.66 (0.09) | 2.73 (0.05) | 2.98 (0.04) |

$k=3$ | 2.15 (0.15) | 2.98 (0.14) | 3.14 (0.09) | 3.24 (0.06) |

$k=15$ | 2.14 (0.24) | 2.69 (0.23) | 3.09 (0.15) | 3.33 (0.10) |

$k=63$ | 1.18 (0.06) | 2.53 (0.36) | 2.81 (0.19) | 4.03 (0.28) |

Figure 11: K-Zip-Tree Height

The heights of 200 randomly generated G-trees, i.e., the maximal number of pointers to traverse from the root to a leaf; for various combinations of $n$ and $k$. The numbers in parentheses give the variance.

The second table normalizes the heights by the heights of perfectly-balanced $k+1$-ary trees on $n$ vertices.

Another concern of interest is space amplification: in the worst case, every item would be in its own $k$-list, resulting in a space amplification factor of $k$. Intuitively, this occurs only rarely, however: the expected size of each G-node is $k$, and the probability for a G-node to have size $s<k$ decreases exponentially in $k−s$. Any overfull G-node consists of one or more full linked-list vertices, plus exactly one underfull linked-list vertex. Hence, the overfull G-nodes contribute an amplification factor of at most two. Roughly half of the nodes will be overful, so we can expect a space amplification around $1.5$. Figure 12 backs up this intuition with experimental data.

$n=100$ | $n=1,000$ | $n=10,000$ | $n=100,000$ | |
---|---|---|---|---|

$k=1$ | 1 (0) | 1 (0) | 1 (0) | 1 (0) |

$k=3$ | 1.338 (0.0055) | 1.302 (0.0006) | 1.298 (0.0000) | 1.297 (0.0000) |

$k=15$ | 1.743 (0.0786) | 1.544 (0.0061) | 1.515 (0.0006) | 1.512 (0.0000) |

$k=63$ | 2.431 (0.612) | 1.669 (0.0316) | 1.578 (0.0032) | 1.565 (0.0003) |

Figure 12: Space Amplification

Total number of item slots divided by total number of items in 200 randomly generated G-trees, for various combinations of $n$ and $k$. The numbers in parentheses give the variance.

Recursive instantiation of G-trees with other G-trees leads to the family of $zip_{k}$-trees when using linked lists (i.e., $1$-lists) as the recursion anchor. We can similarly use arbitrary $k$-lists as the recursion anchor to obtain the $k$-zip-trees, the $k$-zip-zip-trees, and so on.

Through some small adjustments, the definition of G-trees can be adapted to yield generalizations of skip-lists and history-independent analogons of B^{+}-trees (Comer, 1979). We sketch these extensions in Appendix B.

We now present asymptotically efficient algorithms for working with G-trees. To optimize for clarity of presentation, we describe purely functional algorithms: all functions return new values rather than mutating their inputs.

We explicitly distinguish between non-empty and possibly-empty G-trees and internal set data structures, to clarify which cases can arise in our algorithms. To reinforce those distinctions, our pseudocode comes with pseudotypes. The typing also aids in keeping straight the multiple levels of parametric polymorphism (“generics”). Finally, pseudotypes allow us to explicitly define the functions that must be available on $S$ for our algorithms to work.

`Interface for non-empty sets of pairs of items (of type I) and their left subtrees. To be implemented by $S$. Create a set containing a single item and its left subtree. Split self into the set of items and their left subtrees strictly less than key, the left subtree of key if key is an item in self, and the set of items and their left subtrees strictly greater than key. ) -> Self}A set, possibly empty. Parameterized over a type of non-empty sets.}A G-node. Nonzero number of items and their left subtrees. The one right subtree.}A G-tree, possibly empty.}An optional value.}`

The functions we require of a NonemptySet are standard functions that are easily implemented in $O(g(n))$ time with typical set data structures33Not that we needed an efficient implementation: since G-nodes have constant expected size, even $O(n)$ implementations do not hurt the asymptotic efficiency of our algorithms.. The only choice worth commenting on is that of using a specialized insert_min function instead of a generic `insert`

function. This choice is to allow for more efficient, specialized implementations, as well as to highlight that our algorithms interact with the internal set data structures in a surprisingly constrained manner.

Before giving our main algorithms, we define some helper functions:

`Insert an item and its left subtree into a possibly empty set. The new item must be strictly less than any item in set. ) } }}Replace the leftmost left subtree of a GTreeNode. let ( ), }}Replace the right subtree of a GTreeNode. }}A (non-empty) GTree has a root GTreeNode that consists of a rank, a right subtree, and a non-empty set of pairs of items and their left subtrees.`

Occasionally, we need to construct a non-empty GTree from a rank, a right subtree, and a *possibly empty* set of pairs of items and their left subtrees. If the set is empty, the lifted GTree is simply the supplied right subtree. }, ) }}

While it is possible to derive insertion and deletion algorithms by generalizing the original zip tree algorithms (see Appendix C for examples), we opt for a fully self-contained presentation here. Zipping and unzipping are similar to the join and split functions of join-based tree algorithms (Blelloch et al., 2016). We consistently use zip and unzip terminology when operating on G-trees, and join and split terminology when operating on the underlying set datastructure $S$.

We build our insertion and deletion algorithms from algorithms for unzipping and zipping G-trees. Unzipping takes a key and splits a G-tree into the tree of all items less than the key and the tree of all items greater than the keyThe key itself occurs in neither returned tree, even if it is part of the input tree.. Zipping takes two trees, the first containing only items strictly lesser than any item of the second, and joins them together into a single tree. We call this version of zipping `zip2`

, because it takes *two* arguments. We also use a `zip3`

function, which additionally incorporates a single item into the tree that is strictly greater than all items of the first tree, and strictly less than all items of the second tree. Insertion and deletion can then be implemented as composition of unzipping and zipping (`zip3`

and `zip2`

, respectively) — see Figure 13.

To unzip a G-tree, split the inner set of the root GTreeNode, and then performs a case distinctionWe give a more thorough description in the form of code comments. to determine whether it is necessary to recurse:

` Unzipping the empty tree is trivial. For non-empty trees, split the inner set. If s.set contains the split point (key), then everything up until the split point becomes the left return value, with the left subtree of the split point becoming the right subtree of the left return. Everything after the split point becomes the right return, with the right subtree of the current node becoming the right subtree of the right return. No further recursion. ( ) => return ( ) If s.set does not contain the split point, and all its items are less than the split point, then recursively split the right subtree of s. return ( ) } If s.set does not contain the split point, but it does contain items greater than the split point, then recursively split the leftmost subtree of those greater items. let ( let ( return ( ), }), ) } } }}`

Since the expected size of G-nodes is constant with high probability, we can treat all functions of the NonemptySet interface as running in constant time. Each recursive call of the unzip function is to a GTreeNode of strictly decreasing rank, so the recursion depth is bounded by the height of the G-tree, which is logarithmic with high probability. Hence, the overall running time is in $O(g(n))$ with high probability.

We can similarly give a recursive algorithmAgain, the main description takes the form of code comments. for zipping together two G-trees with the same complexity properties:

` If left is empty, then the join is the other tree. If right is empty, then the join is the other tree. Neither tree is empty, so there is real work to do. What that entails depends on how the ranks of the root nodes of the two trees compare. let ( _, ) } ) } else { Equal ranks. Join the two inner sets, with the right subtree of the left node being zipped into the leftmost left subtree of the right node. let ( ) }, ) } } }}`

To implement zipping of two G-trees with an additional item in between them, we can simply convert that item into a singleton G-tree, and then call zip2 twice:

` }, ) }`

With unzip, zip2, and zip3 in place, insertion and deletion in expected logarithmic time become trivial:

`}}`

We want to emphasize that our choice of algorithms optimizes for elegance, not for (non-asymptotic) efficiency. Implementations based on in-place mutations will outperform our immutable algorithms. A direct implementation of zip3 will outperform the reduction to two applications of zip2. Iterative implementations might outperform recursive implementations. And finally, direct implementations of insertion and deletion should outperform those based off unzipping and zipping the full trees. The original zip-tree paper (Tarjan et al., 2021) contains examples of direct algorithms that zip and unzip only parts of a tree; our algorithms can be adapted to work analogously, and we provide variants for G-trees in Appendix C.

We have generalized the zip trees to the rich family of G-trees. G-trees must be instantiated with a concrete set data structure; using a linked list yields the zip trees, and direct recursive self-instantiation yields the zip-zip-trees. All G-trees are history-independent if the underlying set data structure is history-independent, and admit the same efficient algorithms for mutating them. We have proven the G-trees to be similar to perfectly balanced search trees with high probability, a property that makes them asymptotically and practically efficient.

Beyond merely generalizing existing data structures, we have defined the $k$-zip-trees, a conceptually simple family of history-independent data structures that store up to $k$ items in a single vertex. Such data structures make more efficient use of hardware caches than binary trees, allowing the $k$-zip-trees to outperform the binary zip trees.

We have published the code powering our experiments on github (under the MIT license). The implementation written is in rust, and can be built with the cargo package manager. Run `cargo run --bin stats`

to replicate the experiments for Figure 6, Figure 7, Figure 8, Figure 11, and Figure 12. Run `cargo bench`

to recreate the benchmark for Figure 1.

We now sketch several variants of G-trees that might be useful in practice. Highlights include a G-tree-analogon of the B^{+}-tree (Comer, 1979), and a cache-efficient generalization of the skip-list.

When search trees store not only keys but key-value pairs, some usecases benefit from storing all key-value pairs in leaves of the tree. To support efficient lookup of the leaves, the inner vertices of the tree store duplicates of certain keys. We can easily adapt G-trees to this behavior by placing items not only in the tree layer corresponding to their rank but also on all lower layers. Figure 14 depicts such a naïve leafy G-tree, and Figure 15 and Figure 16 show concrete instantiations with a 1-list (a naïve leafy zip-tree) and a 2-list (a naïve leafy 2-zip-tree) respectively. We arbitrarily44Spoiler: the ordering is arbitrary *in principle*, but our choice will surface a deep connection to skip-lists in a few paragraphs. choose to sort items of lower rank to the right of their copies of higher rank.

The figures clearly show a deficiency of this naïve approach: almost all items have a single pointer to the layer below, except for the first item of each rank, which needs two pointers. We can restore uniformity by adding a dummy element $⊥$ at the start of each layer. Figure 17 gives an example of the resulting (proper) leafy G-tree; Figure 18 shows a leafy zip-tree, and Figure 19 shows a leafy 2-zip-tree.

The reader familiar with skip-lists should immediately see that Figure 18 — the rendering of a leafy zip-tree — essentially shows a skip-list, except that some of the edges within the same layer are missing. Figure 21 shows a proper skip-list with the same items and ranks for comparison. Characterizing the “missing” edges from the perspective of leafy G-trees is trivial: there are no edges between vertices of equal rank that belong to different G-nodes.

Characterizing the missing edges from the perspective of skip-lists is quite instructive. Consider the algorithm for searching in a skip-list: start in the topmost layer, follow pointers within a layer until overshooting the search target, drop down a layer when you would overshoot. The “missing” edges are exactly the edges that are always guaranteed to overshoot — if they did not overshoot, the search would never have descended into this part of the skip-list in the first place. Search in a leafy zip-tree encounters a null pointer at the end of each G-node, whereas search in a skip-list blindly dereferences a pointer and performs a comparison whose result is already predetermined.

As far as we are aware, this observation of two classes of next-pointers in skip-lists — those that always overshoot in a search, and those which might not overshoot in some searches — is novel. The fact that skip-lists get to conflate the two classes and algorithmically handle them in a uniform way might well be the underlying reason why skip-lists are so appealing compared to binary search trees.

We can easily augment the leafy G-trees to generalize the skip-lists by adding pointers between successive G-nodes of equal rank. Figure 20 shows the resulting linked leafy G-tree for our running example. Figure 21 instantiates with a linked list, obtaining a linked leafy zip-tree, i.e., a skip-list. Figure 22 shows the instantiation with a 2-list, yielding a linked leafy 2-zip-tree. Instantiating the linked leafy zip-trees with $k$-lists gives a family of generalizations of skip-lists that store $k$ items per vertex. Finding such a family could easily be reason for a dedicated publication, were it not for the fact that the powerful framework of geometric trees gives us this family essentially for free.

To conclude, we point out that linking only the *leaves* of a leafy G-tree yields a family analogous to the B^{+}-trees (Comer, 1979). Figure 23 shows such a G^{+}-tree; Figure 24 shows an instantiation with a linked list (a zip^{+}-tree), and Figure 25 shows an instantiation with a 2-list (a 2-zip^{+}-tree).

As alluded to in Section 5, it is possible to derive direct insertion and deletion functions for G-trees that are analgous to Algorithm 1 from the original zip tree paper. For completeness, we provide pseudocode for purely functional variants of these functions for G-trees. The functions leverage the previously defined zip2 and various helper functions.

We also define an additional helper function which joins a possibly empty Set with a greater NonemptySet.:

`Join a (possibly empty) Set with a greater NonemptySet. }}`

The following explicit delete pseudocode follows a very similar structure to Algorithm 1 from (Tarjan et al., 2021), though we use pattern matching here for consistency.

` Deletion in the empty tree is trivial. For non-empty trees, split the inner set. The target item was found. Simply exclude it and zip the left and right subtrees together. ( ) ) } The target item is strictly less than the leftmost_item of r, recurse down the left and build from the right. let ( )) }, ) } } }}`

While the explicit insert function requires a bit more code, it also follows the same *general structure* of its original zip tree counterpart. The bulk of the additional code is devoted to operations on the inner set, whereas the matched patterns remain mostly equivalent.

` Insertion into the empty tree is trivial. }, ) For non-empty trees, split the inner set. Found the item, nothing more to do. }, ) ), }, ) } else { ) }, ) } } ( ) => { let ( ) ) }, ) ) }, ) } else { ) ), }, ) }, ) } } } }}`