Open Sourced: A tale of two Priority Queue’s

One of my favorite data structures is the binary heap. I first learned about it in my data structures class, and remember marveling at its simplicity and elegance. How could something so simple be so powerful and useful?

For those unfamiliar with the binary heap, it is a relatively simple data structure. It is a binary tree, which means each node has at most 2 children and n top of this, there are several constraints which are imposed. One is that the the children of each node must have a value that is equal to or greater. This means we have some sort of loose ordering, where nodes closer to the root have a lower value. Secondly, the tree must be complete. This means that all levels of the tree except for the last level must be full, and filled left-to-right. As described, this would be a Min-Heap, since the minimum value is always at the root. If we were to reverse our comparisons, so that the children must be less than or equal to the current node, we would have a Max-Heap with the maximum value at the root.

One of the primary uses of a Min/Max Heap is as a Priority Queue. If one inserts nodes based on their “priority”, then the nodes with the highest priority are at the root.

Below is a example of a Min-Heap, where each node represents a persons age and name. We can see that there is a rough ordering with the youngest person at the root.

What makes binary heaps very interesting are their performance characteristics. In the case of a Min-Heap, we can find the minimum element in O(1) which is useful for implementing a Priority Queue. We can also insert and remove nodes in O(log n). Insertion is done by adding the new node in the bottom of the tree (remember it must be filled left-to-right), and then repeatedly swapping it with the parent node while it has a lower value. Since this is a binary tree, the height is at most log n, so we need to do at most log n swaps per insert. Deletion is done by moving the last node to the root, and repeatedly swapping it with the child that has a lower value. Again, we need to do at most log n swaps down any given path of the tree.

This property means that a binary heap can be used to perform sorting in O(n log n) which competes with the best sorting algorithms. So in summary, binary heaps give us:

  • O(1) time for the min/max value
  • O(log n) insertion / deletion time
  • O(n log n) as a sorting implementation
  • Dead simple implementation (Ever tried an AVL tree?)

Okay, so at this point we know that binary heaps are pretty cool. But it gets better. If we had to implement a binary heap as a tree, we would need each node to have a pointer to its children and parents, and we would waste several words per node. However, because binary heaps must be complete they can be implemented as an array. The array is setup as a breadth first pass of the tree, so the root node is first, then all the nodes at depth 1, all the nodes at depth 2, etc. This means the nodes that we are inserting always go at the end of the array, since they are filling the final depth left-to-right.

I needed a quick refresher on C, so I decided to implement a Min-Heap in C. The results can be found on Github here. It is a simple C library that implements the Min-Heap using an array, and supports resizing (growing and shrinking). The approach is to double or halve the array as we need more or less space. What bothered me about this implementation was the wasted memory it causes. Because the array grows simply by doubling, it may be that I only needed a few more nodes but now have allocated space for millions. In addition, some operations require an expensive copy operation which runs in time proportional to the length of the array. I wanted to see if there were any clever tricks that could fix this.

What I decided to do was to make use of a little indirection. Instead of using a plain array, I decided to make use of a small table which points to other arrays which hold the actual entries. This looks something like this:

The idea is to make use of low level virtual memory APIs to manage the memory used for the heap. Modern operating systems all operate on virtual memory, which is basically an abstraction on physical memory. All processes are provided with a memory address space that starts at 0 and goes to 2^32 or 2^64 depending on the CPU architecture, even though most systems do not have that much memory available. The memory space a process has is “virtual” because it does not directly correspond to physical memory, and in fact most of the memory space is not mapped to RAM and will cause an exception if the program ever addresses it (the infamous segmentation fault). Instead, memory is typically divided into “pages” which are smaller chunks, usually 4K in size and pages are mapped to physical memory as needed. This is done by using a number of low level system calls such as brk(), malloc(), and mmap(). These system calls instruct the kernel to map addition pages of virtual memory onto physical memory, so that the program can safely make use of it without crashing.

We can make use of the same methods to implement our heap more efficient. So, I created a separate GitHub project to do just that. Now, instead of allocating a large array, we initially map a single page in to use as our mapping table, and a single page to hold our heap entries. Each node in the heap requires 2 words (8 bytes on a 32 bit system, or 16 on a 64bit system). Since we have 4K pages, we can store 512 or 256 nodes per page depending on 32/64 bit. Our mapping table needs a single word to represent each page, so it can hold 1024/512 references per page.

The major benefit of this new architecture is resizing is now extremely cheap, and memory utilization is much much better. Previously, when we needed more space we would reallocate a whole new array and copy over all the previous entries, which would take O(n) time. Now, we can map in a single page of memory and add a reference to it in our mapping table. We have to be careful to resize the mapping table if it ever runs out of space, but that is cheap due to its small size. Because we only add new pages on an as-needed basis, the most space we waste is about a single page, or 4K. This is as opposed to potentially wasting hundreds of megabytes or gigabytes of space. Similarly, when we are downsizing, we just unmap pages and avoid the copying that the array implementation had to do.

One thing that is unclear is the performance implications of our change. It is simple to see that by adding our indirection table, we now need to take an extra pointer to get to the data for every access of a heap entry. This is disconcerting because the speed of most applications is bounded by memory access and not CPU speed since modern CPU’s have far outstripped memory speed, and latency has remained constant. As an example, most instructions only take several instructions to execute (add, shift, multiply), while a memory access that takes a cache miss might take over 300 CPU cycles. In this context, we might be worried that the indirection table interferes with cache locality and introduces a dramatic slowdown. To test this, I decided to perform a simple benchmark. I took both implementations, and timed them using the provided test program with an input size of 1, 5, 10, 50, and 100 million nodes using this script. The test generates a pseudo-random set of keys and inserts them into the Min-Heap and then extracts them, checking that the order is correct. I averaged the results of 3 runs per input size, and graphed the averages: 

Okay, so what do these numbers tell us? It is important to note that we are graphing on a logarithmic scale, so don’t let that fool you. It is clear that the implementation based on the indirection table is slower in every case, and roughly averages a 30% slowdown. The array implementation is absent from the final test of 100 million nodes because it is unable to allocate an array large enough during a resize to hold all the nodes. The indirection version is able to complete since it uses no more space than it actually needs.

The complexity of modern processors makes it hard to reason about the performance numbers. On one hand, we expect the O(n) copy time for the array implementation to be problematic, but that cost is amortized over the majority of the operations. For the indirection table, we might expect a very large penalty for the extra pointer dereference but with large amounts of cache the indirection table may by largely cached so we avoid paying the penalty for most accesses, especially near the root. With numbers in hand, it is hard to recommend one implementation over another. In situations where memory is limited then the indirection version is more suitable. Additionally, if systems need soft real-time characteristics then the indirection table is a better choice as it will never perform a large copy on any single operation. However, if raw speed is all that matters, then the array implementation is the winner. If the maximum element size is known in advance, then the array can be pre-allocated to that size and avoid the copy operations altogether, making it even faster (the indirection table can do the same, and pre-allocate the pages, saving the allocation time).

I suppose the moral is never assume anything about the performance implications of a particular implementation. Always test and decide based on what criteria are most important to your application, as it’s not one size fits all.

Feel free to as me any questions!

  1. armondadgar posted this