2012-01-15

Finding the Highest M of N

Finding The Highest M of N

How do I find the top M of N? The hundred largest of a million?

One day my boss came out of his office and said "I've got to get the highest hundred values of a bunch of data. I tried sorting it, but it's just taking too long."

He had many Megabytes of floats, and was trying to sort the whole thing in order to find the highest 100 values. There are two immediate approaches to this problem:

  • Sort N and Pick M (QSort)- Sort the whole list (of N items), pull out the first M (100) items.
  • Scan N and Sort M (Insertion Sort) - Create a list of M (100) items, initially filled with the first M items of the input list of N (1 million) items, sorted. Then walk through the rest of the "N" list and for each item put it in the "M" list if it is larger than the smallest value already in the "M" list. This is like an insertion sort, but where you don't let it grow beyond M items.

Both of these are obviously unacceptable. In Big O notation the first option is (with typical QSort, MergeSort, et c.) an O(N*log(N)) operation, and my boss was already trying it. With the example numbers, we're looking at O(N*log2(N)) = O(1x106*20) = O(2x107) operations.

The second option can be optimized a bit, but you need to be prepared to do an insertion sort for every one of your one million items. In practice, you would compare the item from the M list against the smallest value in your N list, and if it was smaller and the N list was already full, then you'd skip the insertion. In Big O notation, this is O(N * M) = O(1x108) operations.

Which one is worse depends on the values of N and M.

As an interesting corner case, what if the list is already sorted, or nearly so? The first option (Sort N and Pick M) can possibly fall apart, since the simplest Quicksort implementations have trouble with this case. The second option (Scan N and Sort M) will also fall apart, since it will always have to insert each new entry into the N list (unless the M elements are positioned at the end, in which case it falls prey to a reverse sorted list).

The Third Way: nth_element

Another way to do this is to do something like a quicksort, but without recursing on the side of the pivot that doesn't contain the element of interest. This is called Quickselect, and generally has time O(N), but worst case O(N2). For sufficiently evenly distributed elements this would first process all elements, then half the elements, then a quarter, et c. The total number of elements processed converges to 2*N.

There is an STL algorithm to do this: nth_element. There is also a similar function, partition, that will take the pivot value. partial_sort will sort values up to a specific position in the list.


Performance!

I wrote a Python script to draw some pretty graphs of performance, and also a C++ program to provide the metrics. I also have the specific results file graphed here.

First a note on how to read the graphs. Time is the y axis, so lower values are better. N is the X axis, so values to the right are operating on larger sets of data. I've color coded datapoints by algorithm (red, green, blue), and shaded them by the value of "M". Lighter shades correspond to smaller areas being sorted.

I've also produced some theoretical performance graphs. These are not calibrated to actual performance. They should not match up with the actual metrics except that they should have the same sort of shape.


First up: Performance of Sort N Pick M (qsort)

"Sort N Pick M" sorts the entire sequence of N items, then picks out the first M items. I've provided results for two implementations. One uses the STL and one uses qsort. The qsort variety must call a user-provided function to compare the items, whereas the STL version can inline that call. The STL version seems to be doing better across the board, with the STL clocking at 0.000738 milliseconds versus 0.002162 for a datapoint near 10,000. Since these are log graphs you can see that the STL seems to be consistently about thrice as fast as using qsort, regardless of N. Yay STL!

Regardless of how many items are picked out (M) it takes about the same amount of time to run this algorithm. Since the results are all contiguous after the sort, that range can be returned without requiring a copy of the M elements. Even if this was necessary, the time to do this, O(M), is much less than the time to sort, O(N*log2N), and would not be particularly noticeable in the timing.

Notice the odd jog near N equal to one million. I looked into that, but the results aren't relevant to this discussion. (I may post details in the future.)


Second chance: Performance of Scan N and Sort M (insertion sort).

The second algorithm looks much different. This is essentially an insertion sort where it doesn't bother inserting if the new value is known to not fit in the M "best" items. Imagine that M is one, how fast would it be? Pretty fast, given that it is essentially doing min() on N items. That is an O(N) operation, and a fast one at that. What about if M is near N? Pretty slow, since insertion sort is an O(N2) operation. Unless M is really near to N, in which case the initial qsort on the M items has already done just about all the work. That explains the initial low values on the graphs for the higher values of M (notice how they track the "Sort All" time from the graphs combining multiple algorithms).

There is a graph above that compares all the algorithms. It shows that for small values of M this algorithm can be faster than any of the others, but for the general case it is much slower.

How Many Inserts Are Required?

How often are inserts required? The likelihood is M/i where i is the position of the element. How many inserts are required, then? Sum as i=M to N of M/i. Integrate from M to N the function M/x... I believe that would be M*ln(N)-M*ln(M) = M*ln(N/M).

Below is a graph of the expected time to do an insertion sort to find the M lowest items of N items, where the M items are initially sorted, and then the remaining items are insertion sorted. This graph, however, has M as the x axis. Increasing M increases the total time, except near the end when M is nearly N and most items are efficiently dealt with by the O(M*log2M) sort. I have plotted the time to do a qsort on the entire list of N items for comparison, and each value of N impacts the qsort line when M reaches N.

However, that graph is a bit misleading since it appears that only very large values of M as compared to N will yield a performance gain. This is because the x axis is presented with logarithmic progression. Actually the worst case is somewhere around half to two-thirds, as can be seen by the graph below which shows the same data but displays the x axis linearly.


Separating the Men From the Boys: Partitioning with nth_element

The graph of the nth_element algorithm is not too suprising. It operates mostly without regard to the value of M, as expected. It is noticably faster than the Sort N Pick M algorithm. This algorithm is O(N), so it should be considerably faster than the O(N*log2N) algorithm of Sort N Pick M.

But why does it appear to have the same slope as the SortN Pick M algorithm? Consider the difference implied by additional multiple of log2N. Log2(N) equals 16.6 for N=100,000 and 19.9 for N = 1,000,000. The increase of the red lines should be 20% more than for the green. That 20% should correspond to the width of the fifth largest tickings on the left hand side. That is a fairly small tick. In other words, after the number get into the millions the log part doesn't affect the result very much incrementally. If the graph went to smaller numbers, or if the y axis was linear instead of logarithmic, then the curve would likely be more obvious.


Coda: Heaps Better

By far, insertion sort is the best performing algorithm. Well, at least for small values of M. It's too bad I can't get the best of both worlds... Insertion Sort time for small M, and Partition time for larger M. Of course, I could always compare M to N and decide which algorithm to use, but that seems so hacky.

Why am I doing an insertion sort at all, when I don't really care what the sorted order is, but just what the highest values are?

Heaps are neat. They are a not-quite-sorted,-but-maybe-sorted-enough type of data structure. Heaps can insert in O(log2N) time. A Heap can be constructed in O(N) time! (Knuth, The Art of Computer Programming , volume 3, section 5.2.3, page 155.)

Heaps always know what their highest item is, because it is in position 0. Heaps work by having the property that the element at index i is always greater than the elements at indices 2*i+1 and 2*i+2 (for the first element being at index 0). They can be thought of as a nearly balanced binary tree where each node is greater than its children. It is not a red/black tree or any other type of sorted list.

To insert you figure out where the next item goes (index i, where i is the current length of the list) and put the new item there. Then you percolate that point up:

  • If the current element is larger than the parent (at index floor(i/2)) then swap the two.
  • If there is no swap necessary, then the insert is complete.
  • If a swap was necessary, then repeat the check at the parent's location.

To pop the highest item, you remove the last item in the list and put it in the list, either as its own sibling, or as the parent, as required upon comparison with the sibling. If a sibling is replaced, it replaces the common parent, which in turn gets to replace its sibling or parent. Eventually you get to the root of the heap and it is taken away as the highest value before the location is overwritten.

The Heapsort algorithm takes an unordered list, turns it into a heap, then burns through the heap, popping off the highest item and placing it in place of the last item. The result is a sorted list. Heap construction takes O(N) time, but the popping takes O(N*log2N) time, same as other sorts. The advantage of Heapsort is that this neither takes additional memory (like merge sort) nor has a work case scenario of O(N2) (like Quicksort). Typical run time is worse than Quicksort, however.

Heaps are ideal for priority lists that will be modified while in use. They can fetch first, insert, and delete items in O(log N) time. This makes them good choices for Huffman Encoding.

What can't heaps do? Well, you can't traverse a heap in sorted order very easily... you basically have to pop off all the items in turn. They are most useful when you just need the highest item, or when the number of inserts and deletes is significantly larger than the size of the heap for each time it is walked in a sorted order.

Instead of using an Insertion Sort, if I just maintain the lowest M items as a heap, then I can easily compare to the highest value, even after an insertion. For evenly distributed numbers, the likelihood of needing to insert into N is M/i where i is the index of the current point. If an insertion is needed it will take O(log2N) instead of O(N) for an insertion sort.

If I do this then we can cut down on the time required for the inserting part of the insertion sort. The result will need a final sort pass if we need items in the correct order instead of just the items listed (this is also true of the nth_element version).

Yet Another Heap Based Algorithm

Another possible use of a heap would be to heap the entire list, then pull out the top M items. The performance of this is quite simple: O(N) to heap, then M*log(N) to pull out the items. (Note that the depth of the tree decreases, thereby reducing the log(N) component as items are pulled out. Given that the log(N) value holds for half the items, even if the other half takes zero time it would still be the same O(M*log(N)).)

However, as can be seen from the graph below, this new algorithm never appears to be better than using a heap for an insertion sort. It only becomes competetive well after using nth_element would have beaten both Heap algorithms.

Heap Performance

The performance is quite reminiscent of the insertion sort, except that the performance does not degrade so much for larger M. Notice the dipping first value is still present. I.e., if M is equal to N then the algorithm is just doing a large qsort, no insertions. And yet, the line for "sort all" is actually above these lower first values. I wonder why that is.

The performance does still get worse than the nth_element case. But instead of being about 100x as slow (or more) the Heap implementation seems to max out at about ten times as slow. (Below is a graph of all the implementations.)

Theoretical Results

Here are the complexities of the various algorithms:
AlgorithmComplexity
Sort N and Pick M (QSort)O(N*log2N)
Scan N and Pick M (Insertion sort)O(M*log2M + (N-M)*M)
nth_elementO(N)
Heap insertion sortO(M + (N-M) + log2(M)*ln(N/M))
Heap All and PopO(N + M*log2(N))

Here is a graph of the theoretical complexity of the algorithms.

The theoretical analysis of the complexity of the algorithms indicates that the Heap insertion sort is the fastest for small M, but that it does grow without bound on m. While it may always be faster for any N on small values of M, there will be values large enough to justify using nth_element instead.


Actual Results

For low M, the Heap and Insertion sort methods have the same performance. This makes sense, since the benefit of using a heap over a straight sorted list is log2(M) versus M when inserting and M versus M*log2(M) for the initial sort. As M approaches one these approach equivalence.

Bottom line: the Heap Insertion Sort is always at least as good as the regular insert sort, but can still be beat by nth_partition for larger values of M.


Creative Commons License Finding the Highest M of N by Mark Santesson is licensed under a Creative Commons Attribution 3.0 Unported License.

No comments:

Post a Comment