Mostly High Performance

2012-10-11

My Development Tools

Whether developing on my desktop or laptop, Linux or Windows, my tools of choice are pretty bare-bones. I edit my files in vim and I incrementally develop my program by using my "kr" script. "kr" looks for changes in whatever files you give it, and then reruns a command when it detects a change. I.e.:

> kr *.py -c 'tests.py'

Would look for a change in any .py file and rerun the "tests.py" script.

> kr *.c *.h makefile -c 'make ; application param1 param2'

Would look for a change in any .c/.h file, or the makefile, and when a change is detected it remakes and runs "application" with the two params "param1" and "param2".

Developing is easy, just save the file I'm editing and look in the other window to see what happens with the new code. Usually I use this for re-running a python program, which I develop a bit at a time. I fix an error, save, and quickly see the next error. It works best, of course, when the file is able to run quickly.

I've also used it to check for a data file change, at which point it regenerates an image. In another window a separate "kr" script looks for a change in the image file, and runs a program to render it. (The two "kr" scripts had to run on different machines, or else they could have been combined.)

Find the current version here.



#Copyright 2012 Mark Santesson
#
#   Licensed under the Apache License, Version 2.0 (the "License");
#   you may not use this file except in compliance with the License.
#   You may obtain a copy of the License at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
#   Unless required by applicable law or agreed to in writing, software
#   distributed under the License is distributed on an "AS IS" BASIS,
#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#   See the License for the specific language governing permissions and
#   limitations under the License.

import sys
import time
import re
import stat
import os
import glob
import operator
import optparse


if __name__ == "__main__":

    parser = optparse.OptionParser (
                        usage = '%prog -c  files ...',
                        description = 'Repeatedly run a command whenever any one of a list of files changes.')
    parser.add_option('-c', '--command', dest='command', type="string",
            help='Specify command to run when a change is detected')
    options, args = parser.parse_args()

#    print "Found args: " + repr(args)
#    print "Found command: " + repr(options.command)

    if not args:
        print '\n' + os.path.basename( sys.argv[0] ) + ' requires some files to monitor.'
        sys.exit(1)

    if not options.command:
        print '\nError: A command to run is required.'
        sys.exit(1)


    file_times = dict()
    globbed = sum( [ glob.glob(x) for x in args ], [] )
    for x in globbed:
        file_times[x] = 0.0

    print file_times

    while True:
        rerun = False

        for f in file_times.keys():
            attempts_left = 5
            t = None
            while attempts_left and t == None:
                try:
                    t = os.stat(f)[ stat.ST_MTIME ]
                except WindowsError, e:
                    time.sleep(0.1)
                    attempts_left -= 1

            if t > file_times[f]:
                rerun = True
                file_times[f] = t

        if rerun:
            print time.strftime('\n\n--- Rerunning at %H:%M:%S :') + repr(options.command)
            ret = os.system( options.command )

            if ret:
                print '\n\n--- ERRORED %r' % (ret,) + time.strftime(' at %H:%M:%S.')

            else:
                print time.strftime('\n\n--- Done at %H:%M:%S.')

        else:
            time.sleep(0.5)

else:
    raise Exception( 'This module is not meant to be imported.' )

My Development Tools by Mark Santesson is licensed under a Creative Commons Attribution 3.0 Unported License.

2012-09-04

Detecting Loops in Linked Lists

<br /> Detecting Loops in Linked Lists<br />

I was interviewing at a company and one of the questions was how to detect loops in a linked list.

A linked list is where there are a bunch of objects... perhaps bits of memory that have been allocated... and each one points to the next one in the list. Usually, the last one points to NULL. Sometimes the last one points to the "first" one. But then, in such a list, what does "first" really mean? (Perhaps how you refer to the list as a whole.)

Sometimes you may have a non-circular list where the last element points to an item elsewhere in the list, rather than NULL. Or perhaps you think you have a circular list but the last element does point to NULL. That would be an error. You'd like to be able to detect it... and quickly. To do this, you need to be able to determine if a list has a loop, but you can't just scan through it looking for NULL or the start value, because the "last" item might not point to the "first"; the last node could point to the node just before it. Or the one before that...

This is the sort of thing I probably saw in college but didn't remember. On the spot, I came up with the following algorithm:

Jumping by Powers of Two Algorithm


 bool DoesListLoop( Node const * cur )
 {
  Node const * base = cur;
  int jump = 1;
  int remain = 1;
  while (cur) 
  {
   cur = cur->next;
   if (cur == base) return true;
   if (! --remain)
   {
    jump *= 2;
    remain = jump;
    base = cur;
   }
  }
  return false;
 }

In short, this will race a pointer along the list and every time it has traversed a number of nodes equal to a power of two it resets the base pointer to the current pointer. Each time it moves the "current" pointer ahead it compares it to the base pointer. A match indicates a loop. Finding a NULL means it ended and there is no loop.

More fundamentally, any solution to this problem must advance a pointer through the list forever. If the end of the list is found then there is no loop. If it loops then you will never reach the end, but in order to discern this from an arbitrarily long list you need to match one node with another node previously seen. In order to match loops that loop back to a point arbitrarily far into the list you need to look arbitrarily far ahead of the start, but also arbitrarily far behind the node you are checking in the front. You need to let the "cur" pointer get farther and farther from both what you are comparing against and the start of the list. I did this by resetting the "base" pointer at larger and larger intervals.

Back at the interview, they asked me "do you mean, that you have two pointers, and you advance one pointer two steps for every one step that you advance the other pointer?"

"That sounds like it would work, but, no, I don't mean that." I explained my idea in more detail. They said they thought that it would work, but they wanted to talk about their version some more. Presumably they thought it was a better algorithm (but perhaps they just thought it was simpler and made for a better interview discussion). I think they were wrong. As I've since learned, they meant Floyd's Cycle-Finding Algorithm, published in 1967, also known as "The Tortoise and the Hare Algorithm":

"Two Steps for the Forward Pointer, One Step for the Back Pointer"


 bool DoesListLoop( Node const * cur )
 {
  Node const * base = cur;
  if (!cur) return false;
  while (true)
  {
   cur = cur->next;
   if ( ! cur ) return false;
   if ( base == cur ) return true;

   cur = cur->next;
   if ( ! cur ) return false;
   if ( base == cur ) return true;

   base = base->next;
  }
 }

How do we measure performance?

As far as performance is concerned, we want to minimize the time (or number of operations) required to determine whether a list is looping or not. We might want to optimize for the non-looping case since that is probably the most likely case (if loops are errors), and we'd like to be able to identify those quickly. We also may have short or long lists, the looping part might be shorter or longer, and the size of the nodes might vary.

This graph shows the number of operations to solve the various cases. Here I take an operation to be a memory fetch to look at the next node. I don't count branching as a cost. (Branch prediction should work very well in this case.) This certainly isn't a perfect estimate, but it does capture what I think proves to be the major cost.

Theoretical cost of the two algorithms. The "Half-Speed" algorithm is in green. My "Jumping by Powers of Two" algorithm is in red. The Y axis counts "operations" to solve the problem. An operation is a memory fetch. The X axis is the number of elements in the list. Lighter shades indicate lists where the loop goes back closer to the beginning of the list. Darker shades indicate looping more towards the end of the list. The single lightest shade indicates no loop. The blue points indicate the benefit of the jumping algorithm over the half-speed algorithm.

Theoretical cost of the two algorithms, as above, but with a log-log scale and without the blue difference points.

From this simulation we can see that the jumping algorithm is always as good or better than the half-speed algorithm. Although, it should be restated that this is just a simulation of the memory fetching cost. For small lists that can fit entirely into local cache, the cost is only 2/3 of this for the half-speed case. In that case, you would have some cases where using the half-speed was better.

Lighter shades (looping goes back closer to the beginning) generally appear to be harder to solve.

Also note that the performance of the two algorithms don't diverge... they always stay pretty close, at least on a log scale. This confirms that they are of the same order, O(n).

It is nice to see that the theory confirms our expectations, but models are no proof of reality, so let's move on to actual recorded metrics.

Demonstrated Performance

I've implemented the algorithms in C++ and timed them on a Linux desktop using low-end hardware with a vintage around 2010.

Actual performance. Red is my "jumping" algorithm, green is the half speed algorith. X axis is the number of nodes in the list. Color indicates where the list loops back to... lighter colors loop back closer to the beginning, but the lightest color does not loop.

The difference between the two algorithms. Positive numbers reflect an advantage to my jumping algorithm.

These results track well with the theoretical results (note the log-log graph). It appears to be more or less a wash until the size reaches well over a million and then (when it no longer fits in cache?) the jumping algorithm takes off.

Here's a graph of the algorithms when looking only at a non-looping list.

Performance with different sizes of nodes. Theoretically, cache effects might be exposed at earlier list sizes. The differences are not terribly obvious, althoguh jumping still seems to perform better than half-speed.

The difference between the two algorithms. Positive numbers reflect an advantage to my jumping algorithm. There doesn't seem to be much pattern to the distribution of node size (shade).

Finally, here's the most important graph. Relative performance for verifying a non-looping list, as a function of size of the list. As this would normally by the typical use case, this is therefore the most important graph. If a list did loop, then we wouldn't be as concerned about the performance since we're probably about to report an error and stop the program.

Performance of verifying that a list does not loop.

The difference between the two algorithms for non-looping lists. The scale is percentage of the half-speed algorithm's time to completion. The value indicates how much less time the jumping case took.

As it turns out, the jumping algorithm isn't really much better than the half-speed algorithm in my theoretically most differentiating test case. From the difference graph it appears that the benefit does continue increasing as the length of the list increases. This indicates that the benefit is not kicking in for a while, presumably because of cache.

Theoretically, not traversing the list twice should yield significant benefits once the memory that is used exceeeds the cache size. (Branch prediction would work very well in this case, but it still needs to get the memory in before it can fetch the next memory location.) This seems to be what we are seeing.

Parting Thoughts

If there are n elements in the list, both of these algorithms will terminate before running through about 2n nodes. They are both O(n). Can we do better? While we certainly can't improve on O(n), perhaps we can run through fewer nodes.

Ideally we'd be able to quickly check if we'd ever seen a node before, and therefore terminate after n nodes when we either find the end or find a node which we've seen before. However, the naive implementation would have us checking each node against each node we've already seen, which makes the algorithm O(n²).

Theoretically, a hash map would let us find whether we'd already seen a point in constant time, but the hash adds and lookups would probably take longer than the 2x multiple that we're trying to remove.

I think this is the best that can be done, excepting, of course, if you can mark flags to indicate if a node has been visited before. The is a page linked to below that has details on this method.

A silly digression in which an O(1) algorithm is presented

Interestingly, there is an algorithm technically qualifying as O(1) that works on modern computers. Specifically, you can race a pointer along in memory counting how many nodes you've seen. If the count reaches a number higher than you can fit in the memory available, then you must have a loop. This is worst-case O(1) because the number of operations is constant with respect to the list size... it only varies with the size of memory that is available to contain the nodes in the list, which is presumably constant. But if you consider the typical size of linked lists in comparison to even a 32-bit memory address, you can quickly see that this is also about the most inefficient algorithm imaginable. And therefore we have an O(1) algorithm which is probably the least efficient algorithm. Of course, it all depends on what your independent variable "n" represents. If it represents the number of nodes in your list, then this is O(1) (since the number of operations is constant, the memory size, regardless of how large the list is). If it represents memory size, or maximum possible nodes, then this is O(n), and probably much less efficient than the other algorithms listed. Always use an "n" which is meaningful for the analysis.

Hypothetical Case

Performance often depends upon the use case. For instance, if we are usually dealing with a non-looping list and just want to verify that quickly, then a better algorithm might be to track the expected length of the list and just race to the end. If it hasn't ended, then a more exhaustive search could be done. Or, if it is expected to loop, and always to the beginning (circular linked list), then the search can optimize for that.

Suppose there is a prioritized queue of elements. The elements are allocated as a std::vector because changing the size is infrequent and the number only ever grows. The elements refer to each other through offsets. The order rearranges as the priority of the various items shift around. For whatever reason, we want to be sure that the list terminates and doesn't loop. (Perhaps we are implementing the priority queue for Dijkstra's algorithm.) What is the most efficient way to do this? I suspect that the answer is in the previous (silly) section.

2012-01-15

Finding the Highest M of N

Finding The Highest M of N

How do I find the top M of N? The hundred largest of a million?

One day my boss came out of his office and said "I've got to get the highest hundred values of a bunch of data. I tried sorting it, but it's just taking too long."

He had many Megabytes of floats, and was trying to sort the whole thing in order to find the highest 100 values. There are two immediate approaches to this problem:

Sort N and Pick M (QSort)- Sort the whole list (of N items), pull out the first M (100) items.
Scan N and Sort M (Insertion Sort) - Create a list of M (100) items, initially filled with the first M items of the input list of N (1 million) items, sorted. Then walk through the rest of the "N" list and for each item put it in the "M" list if it is larger than the smallest value already in the "M" list. This is like an insertion sort, but where you don't let it grow beyond M items.

Both of these are obviously unacceptable. In Big O notation the first option is (with typical QSort, MergeSort, et c.) an O(N*log(N)) operation, and my boss was already trying it. With the example numbers, we're looking at O(N*log₂(N)) = O(1x10⁶*20) = O(2x10⁷) operations.

The second option can be optimized a bit, but you need to be prepared to do an insertion sort for every one of your one million items. In practice, you would compare the item from the M list against the smallest value in your N list, and if it was smaller and the N list was already full, then you'd skip the insertion. In Big O notation, this is O(N * M) = O(1x10⁸) operations.

Which one is worse depends on the values of N and M.

As an interesting corner case, what if the list is already sorted, or nearly so? The first option (Sort N and Pick M) can possibly fall apart, since the simplest Quicksort implementations have trouble with this case. The second option (Scan N and Sort M) will also fall apart, since it will always have to insert each new entry into the N list (unless the M elements are positioned at the end, in which case it falls prey to a reverse sorted list).

The Third Way: nth_element

Another way to do this is to do something like a quicksort, but without recursing on the side of the pivot that doesn't contain the element of interest. This is called Quickselect, and generally has time O(N), but worst case O(N²). For sufficiently evenly distributed elements this would first process all elements, then half the elements, then a quarter, et c. The total number of elements processed converges to 2*N.

There is an STL algorithm to do this: nth_element. There is also a similar function, partition, that will take the pivot value. partial_sort will sort values up to a specific position in the list.

Performance!

I wrote a Python script to draw some pretty graphs of performance, and also a C++ program to provide the metrics. I also have the specific results file graphed here.

First a note on how to read the graphs. Time is the y axis, so lower values are better. N is the X axis, so values to the right are operating on larger sets of data. I've color coded datapoints by algorithm (red, green, blue), and shaded them by the value of "M". Lighter shades correspond to smaller areas being sorted.

I've also produced some theoretical performance graphs. These are not calibrated to actual performance. They should not match up with the actual metrics except that they should have the same sort of shape.

First up: Performance of Sort N Pick M (qsort)

"Sort N Pick M" sorts the entire sequence of N items, then picks out the first M items. I've provided results for two implementations. One uses the STL and one uses qsort. The qsort variety must call a user-provided function to compare the items, whereas the STL version can inline that call. The STL version seems to be doing better across the board, with the STL clocking at 0.000738 milliseconds versus 0.002162 for a datapoint near 10,000. Since these are log graphs you can see that the STL seems to be consistently about thrice as fast as using qsort, regardless of N. Yay STL!

Regardless of how many items are picked out (M) it takes about the same amount of time to run this algorithm. Since the results are all contiguous after the sort, that range can be returned without requiring a copy of the M elements. Even if this was necessary, the time to do this, O(M), is much less than the time to sort, O(N*log₂N), and would not be particularly noticeable in the timing.

Notice the odd jog near N equal to one million. I looked into that, but the results aren't relevant to this discussion. (I may post details in the future.)

Second chance: Performance of Scan N and Sort M (insertion sort).

The second algorithm looks much different. This is essentially an insertion sort where it doesn't bother inserting if the new value is known to not fit in the M "best" items. Imagine that M is one, how fast would it be? Pretty fast, given that it is essentially doing min() on N items. That is an O(N) operation, and a fast one at that. What about if M is near N? Pretty slow, since insertion sort is an O(N²) operation. Unless M is really near to N, in which case the initial qsort on the M items has already done just about all the work. That explains the initial low values on the graphs for the higher values of M (notice how they track the "Sort All" time from the graphs combining multiple algorithms).

There is a graph above that compares all the algorithms. It shows that for small values of M this algorithm can be faster than any of the others, but for the general case it is much slower.

How Many Inserts Are Required?

How often are inserts required? The likelihood is M/i where i is the position of the element. How many inserts are required, then? Sum as i=M to N of M/i. Integrate from M to N the function M/x... I believe that would be M*ln(N)-M*ln(M) = M*ln(N/M).

Below is a graph of the expected time to do an insertion sort to find the M lowest items of N items, where the M items are initially sorted, and then the remaining items are insertion sorted. This graph, however, has M as the x axis. Increasing M increases the total time, except near the end when M is nearly N and most items are efficiently dealt with by the O(M*log₂M) sort. I have plotted the time to do a qsort on the entire list of N items for comparison, and each value of N impacts the qsort line when M reaches N.

However, that graph is a bit misleading since it appears that only very large values of M as compared to N will yield a performance gain. This is because the x axis is presented with logarithmic progression. Actually the worst case is somewhere around half to two-thirds, as can be seen by the graph below which shows the same data but displays the x axis linearly.

Separating the Men From the Boys: Partitioning with nth_element

The graph of the nth_element algorithm is not too suprising. It operates mostly without regard to the value of M, as expected. It is noticably faster than the Sort N Pick M algorithm. This algorithm is O(N), so it should be considerably faster than the O(N*log₂N) algorithm of Sort N Pick M.

But why does it appear to have the same slope as the SortN Pick M algorithm? Consider the difference implied by additional multiple of log₂N. Log₂(N) equals 16.6 for N=100,000 and 19.9 for N = 1,000,000. The increase of the red lines should be 20% more than for the green. That 20% should correspond to the width of the fifth largest tickings on the left hand side. That is a fairly small tick. In other words, after the number get into the millions the log part doesn't affect the result very much incrementally. If the graph went to smaller numbers, or if the y axis was linear instead of logarithmic, then the curve would likely be more obvious.

Coda: Heaps Better

By far, insertion sort is the best performing algorithm. Well, at least for small values of M. It's too bad I can't get the best of both worlds... Insertion Sort time for small M, and Partition time for larger M. Of course, I could always compare M to N and decide which algorithm to use, but that seems so hacky.

Why am I doing an insertion sort at all, when I don't really care what the sorted order is, but just what the highest values are?

Heaps are neat. They are a not-quite-sorted,-but-maybe-sorted-enough type of data structure. Heaps can insert in O(log₂N) time. A Heap can be constructed in O(N) time! (Knuth, The Art of Computer Programming , volume 3, section 5.2.3, page 155.)

Heaps always know what their highest item is, because it is in position 0. Heaps work by having the property that the element at index i is always greater than the elements at indices 2*i+1 and 2*i+2 (for the first element being at index 0). They can be thought of as a nearly balanced binary tree where each node is greater than its children. It is not a red/black tree or any other type of sorted list.

To insert you figure out where the next item goes (index i, where i is the current length of the list) and put the new item there. Then you percolate that point up:

If the current element is larger than the parent (at index floor(i/2)) then swap the two.
If there is no swap necessary, then the insert is complete.
If a swap was necessary, then repeat the check at the parent's location.

To pop the highest item, you remove the last item in the list and put it in the list, either as its own sibling, or as the parent, as required upon comparison with the sibling. If a sibling is replaced, it replaces the common parent, which in turn gets to replace its sibling or parent. Eventually you get to the root of the heap and it is taken away as the highest value before the location is overwritten.

The Heapsort algorithm takes an unordered list, turns it into a heap, then burns through the heap, popping off the highest item and placing it in place of the last item. The result is a sorted list. Heap construction takes O(N) time, but the popping takes O(N*log₂N) time, same as other sorts. The advantage of Heapsort is that this neither takes additional memory (like merge sort) nor has a work case scenario of O(N²) (like Quicksort). Typical run time is worse than Quicksort, however.

Heaps are ideal for priority lists that will be modified while in use. They can fetch first, insert, and delete items in O(log N) time. This makes them good choices for Huffman Encoding.

What can't heaps do? Well, you can't traverse a heap in sorted order very easily... you basically have to pop off all the items in turn. They are most useful when you just need the highest item, or when the number of inserts and deletes is significantly larger than the size of the heap for each time it is walked in a sorted order.

Instead of using an Insertion Sort, if I just maintain the lowest M items as a heap, then I can easily compare to the highest value, even after an insertion. For evenly distributed numbers, the likelihood of needing to insert into N is M/i where i is the index of the current point. If an insertion is needed it will take O(log₂N) instead of O(N) for an insertion sort.

If I do this then we can cut down on the time required for the inserting part of the insertion sort. The result will need a final sort pass if we need items in the correct order instead of just the items listed (this is also true of the nth_element version).

Yet Another Heap Based Algorithm

Another possible use of a heap would be to heap the entire list, then pull out the top M items. The performance of this is quite simple: O(N) to heap, then M*log(N) to pull out the items. (Note that the depth of the tree decreases, thereby reducing the log(N) component as items are pulled out. Given that the log(N) value holds for half the items, even if the other half takes zero time it would still be the same O(M*log(N)).)

However, as can be seen from the graph below, this new algorithm never appears to be better than using a heap for an insertion sort. It only becomes competetive well after using nth_element would have beaten both Heap algorithms.

Heap Performance

The performance is quite reminiscent of the insertion sort, except that the performance does not degrade so much for larger M. Notice the dipping first value is still present. I.e., if M is equal to N then the algorithm is just doing a large qsort, no insertions. And yet, the line for "sort all" is actually above these lower first values. I wonder why that is.

The performance does still get worse than the nth_element case. But instead of being about 100x as slow (or more) the Heap implementation seems to max out at about ten times as slow. (Below is a graph of all the implementations.)

Theoretical Results

Here are the complexities of the various algorithms:

Algorithm	Complexity
Sort N and Pick M (QSort)	O(N*log₂N)
Scan N and Pick M (Insertion sort)	O(Mlog₂M + (N-M)M)
nth_element	O(N)
Heap insertion sort	O(M + (N-M) + log₂(M)*ln(N/M))
Heap All and Pop	O(N + M*log₂(N))

Here is a graph of the theoretical complexity of the algorithms.

The theoretical analysis of the complexity of the algorithms indicates that the Heap insertion sort is the fastest for small M, but that it does grow without bound on m. While it may always be faster for any N on small values of M, there will be values large enough to justify using nth_element instead.

Actual Results

For low M, the Heap and Insertion sort methods have the same performance. This makes sense, since the benefit of using a heap over a straight sorted list is log₂(M) versus M when inserting and M versus M*log₂(M) for the initial sort. As M approaches one these approach equivalence.

Bottom line: the Heap Insertion Sort is always at least as good as the regular insert sort, but can still be beat by nth_partition for larger values of M.

Finding the Highest M of N by Mark Santesson is licensed under a Creative Commons Attribution 3.0 Unported License.

2012-01-08

When to make a Destructor Virtual

When do I make my destructors virtual?

I've interviewed several people for C++ positions and one question we always ask is what a virtual destructor is and when to use one. For some reason, a suprising number of people say that you should make your destructor virtual whenever you have anything else in the class that is virtual. That is not the correct answer. (Although, it is suggested by Bjarne Stroustrup in the ARM (page 278), and I suppose that is a pretty good basis for the claim...) There may be a correlation between the two, but there is absolutely no causative relationship.

There are three reasons that must all be fulfilled^(*) before making a destructor virtual.

The class might be derived from.
A pointer to this type might be used to destroy instances of derived class.
Either of ^(*):
- Someone might define custom delete functions (that might be different for the base and derived classes, or might be sensitive to the object's real size).
- Some derived classes might need their destructors called.

^(*) Unless you want to follow The Standard... (which, you know, maybe you should...)

Technically, you should ignore that last one. The standard says:

5.3.5 Delete
[...]
3 In the first alternative (delete object), if the static type of the operand is different from its dynamic type, the static type shall be a base class of the operand's dynamic type and the static type shall have a virtual destructor or the behavior is undefined. In the second alternative (delete array) if the dynamic type of the object to be deleted differs from its static type, the behavior is undefined.

(Emphasis mine.)

Which means: if you're going to delete an object of class B through a pointer to class A (where B is derived from A), then A had better have a destructor that is virtual. That doesn't mean that it needs to declare it virtual itself but if one of A's superclasses doesn't declare the destructor virtual then A's destructor does need to be declared as virtual. And if you intend to rely on a higher base class declaring it virtual, you should probably just declare it so as well... in case someone does something utterly foolish like making it no longer virtual.

If that doesn't convince, see what some learned elders have to say on it:

Scott Meyers, Effective C++ Second Edition (Addison Wesley, 1997) 60-62. Item 14 discusses the problem of virtual destructors in detail.
Bjarne Stroustrup, The C++ Programming Language Third Edition (Addison Wesley, 1997) 319. Section 12.4.2 discusses abstract classes and mentions rather offhandedly: "Many classes require some form of cleanup for an object before it goes away. Since the abstract class [the base class] cannot know if a derived class requires such cleanup, it must assume that it does require some. We ensure proper cleanup by defining a virtual destructor [base class]::~[base class]() in the base and overriding it suitably in the derived classes."
Margaret A. Ellis and Bjarne Stroustrup, The Annotated C++ Reference Manual, (Addison Wesley, 1990) 277-278. (Also known as "The ARM".) The commentary at the end of the page rather tepidly states "It is often a good idea to declare a destructor virtual".
Marshall Cline, C++ FAQ (http://www.parashift.com/c++-faq-lite/virtual-functions.html#faq-20.7), A good description of the problem and the rule.

However, if you feel like living dangerously, or even if you don't but you really need that extra performance, then you might be able to get away with it.

Now, normally I'm really a fairly honest person. Really! And I'm a big advocate of not lying to the compiler. (Generally, it takes its revenge.) But sometimes lying is just communicating something in a different way. In this case, I'm saying to the compiler: "I know you really want to dot your I's and cross your T's, but believe me, in this case you really don't need to. Just call the destructor of the type of the pointer."

That said, you do still have to be really careful. There are a few reason why you really would need to have the derived class's destructor called:

The derived class has data members (such as raw pointers), or other logical ownerships such as open log files, that must be cleaned up.
You have data member items in the derived class and they need their destructors to be called. (Such as an auto_ptr.)
You have explicitly defined the derived class's destructor code and need it to be called.
A derived class, or one of its derived classes, might provide a custom delete operator. If so, the size passed to the delete function is determined by the dynamic type of the class if the destructor is virtual, but by the static type of the pointer if the destructor is not virtual.
A derived class might be a different size, and you may be deleting an array of them.

If all of those reasons are absent, then you might be able to get away with not having your class use virtual destructors.

When can I not delete through a pointer to the base class?

Related to the discussion about when a function needs to be virtual, is when can an array of a derived class be deleted through a pointer to a base class? If the sizes are the same (and assuming you don't have to worry about the destructor not being virtual), then the answer is whenever you like. If they aren't the same, then never.

To see why, imagine that you have class A and B, where B derives from A, and both hold an integer. A always sets it to 1, and B sets it to 2. Something like (source here):


 class A {
  int nDataA;
  A() : nDataA(1) {}
 };

 class B : public A {
  int nDataB;
  B() : nDataB(2) {}
 };

If I have it create an array of three Bs and delete them as Bs, then (after instrumenting the classes) I see:


 Creating B[3] dynamically, deleting through B*
 Constructing at 0xd915c
 Constructing at 0xd9168
 Constructing at 0xd9174
 Dest of 0xd9174. I am now B :(size is 12 bytes)  B83F0400 01000000 02000000
 Dest of 0xd9174. I am now A :(size is 8 bytes)  08400400 01000000
 Dest of 0xd9168. I am now B :(size is 12 bytes)  B83F0400 01000000 02000000
 Dest of 0xd9168. I am now A :(size is 8 bytes)  08400400 01000000
 Dest of 0xd915c. I am now B :(size is 12 bytes)  B83F0400 01000000 02000000
 Dest of 0xd915c. I am now A :(size is 8 bytes)  08400400 01000000

If I do the same, but delete it through an A pointer, then I see:


 Creating B[3] dynamically, deleting through A*
 Constructing at 0xd915c
 Constructing at 0xd9168
 Constructing at 0xd9174
 Dest of 0xd916c. I am now A :(size is 8 bytes)  08400400 02000000
 Dest of 0xd9164. I am now A :(size is 8 bytes)  08400400 B83F0400
 Dest of 0xd915c. I am now A :(size is 8 bytes)  08400400 01000000

(These results were obtained with the DJGPP compiler, which uses gcc.)

Notice how it deletes backwards; it starts at 0xd916c and goes down to 0xd915c. Also notice that those aren't the addresses that were constructed, unlike the first case where we destroyed through a B pointer. Notice what happens when you call it through Bs destructor: it starts with the vptr at 00043FB8 and calls ~B, then sets it to 00044008 for when it calls ~A. This is as expected, of course, since when you call through the constructors and destructors, virtual functions resolve to the current class being constructed/destructed, not the class that the object will eventually be.

Now look at the sequence of destructor when it is done through the A* pointer. The first one that is destroyed is destroyed through a pointer that is actually pointing to the "1" of the second item. Before the ~B code is called the vptr for A's vtable is written on top of the "1". The same goes for the second item, which starts out pointing to the "2" of the first item.

Clearly this is bad. Unfortunately, even making the destructor of A virtual will not solve this problem. Theoretically, if the class being pointed to has a vtable it would be possible to encode the sizeof in the table and use that when iterating through the array. Or, the compiler could store the sizeof in the table, as it stores the number of entries in the array. Gcc doesn't do this, for better or worse. The bottom line is that if you want to destroy an array of Bs with a pointer to As, then sizeof(B) must equal sizeof(A), in addition to the other rules stated above, regardless of whether or not the destructor is virtual.

Hacky "Fixes"

You could ensure that class A is not ever deleted in array form by overloading that operator and throwing an exception, but then how do you know that you aren't just deleting a bunch of "A"s, which is fine? (If you overload the delete array operator for "B" (in order to throw), then it will only get called in the instance where it is being deleted properly, which is actually worse than doing nothing.)

You could implement a custom destructor for arrays of A, and test in the destructor if it is seeing an object of class B in order to handle it specially. But you would still have the wrong pointer to the class... you'd have to fix that up with pointer math after you figure out what your base is. This is so terrible that I refuse to think about it anymore.

Why not always make it virtual

There is a performance penalty to making and calling a virtual destructor. Making a destructor virtual will force the class to store a pointer (called a vptr) to the class's virtual table. If there are no other virtual functions in the class, then this is a little bit of extra overhead (whatever the pointer size is: usually 32 bits, but sometimes 64 bits or even something else). This is probably where the typical erroneous answer comes from. Presumably people think that if you are already paying the space cost, then you might as well make the destructor virtual.

There is also a performance penalty to calling a virtual function, including virtual destructors. The vptr must be dereferenced and the resulting function called. This can cause memory cache misses, et c.

Use at Your Own Risk

There you go. Be careful. Or just play it safe and follow the standard. Or did I miss something?

When to make a Destructor Virtual by Mark Santesson is licensed under a Creative Commons Attribution 3.0 Unported License.