2015-02-16

Performance of Membership Tests


Some people think that performance in Python is inscrutable. But surely there is one thing we can all agree on... membership tests should be done on something hashed, like dictionaries or sets, not on a list.

Metrics: M values between 0 and N-1 are in a collection. How long does it take to test for each of N possibilities in turn? These values are the total search time divided by the number of searches, so they are an average time per single lookup.



Analysis

The list is, unsurprisingly, a couple of magnitudes worse than the set and dict. Note that the amount of time it takes to lookup a value in a list is linear with the number of elements. (The value of each increasing dot increases exponentially, so a constant distance between dots on a log graph implies linear growth.) This makes intuitive sense.

The time it takes to look up a value in a set or dict is suspiciously identical. In fact, they use the same algorithm and share some code. You can think of sets as dictionaries with an unused value.

Digression - Python implementation of dict

In Python a dict is a hash map. (Maps don't need to be hash mapped; they could be a binary tree like in C++.) Basically, the key is turned into an integer value by way of the hash function, __hash__. The hash value modulo the size of the hash indicates into which slot the key/value pair is inserted. Sometimes, multiple keys will hash to the same slot and "collide". When this happens, there are two solutions: open or closed addressing. With open addressing (such as for a Python dict) the next1 slot is used. With closed addressing, each slot is a linked list of all entries which hash to that slot2.
Clearly, open addressing cannot contain more key/value pairs than there are slots, so they will automatically resize when they get nearly full. There is a fairly severe performance degradation that happens when the hash begins to be about three quarters full; hashes typically resize at a given occupancy rate3.


Notes:

1: ... for certain values of "next". Sometimes the next slot is the previous slot index plus one and progresses linearly, other times it progresses quadratically. Other times it may be based upon the hash (before the modulo). It may take into account how many times it has tried to find a slot. The exact algorithm doesn't matter just so long as it is repeatable and gets to a blank slot quickly.

2: Using a linked list is simpler than scanning ahead, but it has problems of its own. There can be serious performance degradation if the keys hash to the same slot (not just the same hash value, which would trip up open addressing as well). This makes the closed addressing solution particularly unsuitable where the keys may be chosen maliciously. Using something like a tree rather than a list would help, but not eliminate, the issue. Open addressing could also be susceptible to malicious keys, although such implementations must support dynamic resizing, which helps a bit (both in complicating the attack and automatically resolving one).

3: Like any resize operation, the complexity is probably O(n), but amortized to O(1). Some applications may need better time guarantees. It is possible to incrementally resize the hash. For instance, any addition could be added to the new, double-sized hash, while queries could check both hash locations. The hashing function is probably the most expensive operation, and that can be reused. The modulo would be done twice. Any query could also migrate the value to the new hash. (But the slot should still be marked as occupied, or else searching for collided values may fail). Any insert should be accompanied with a migration of an old value. That way, it is guaranteed that the old hash has been completely migrated by the time the new (double-sized) hash has filled up.


Graph Analysis

On the first graph the X axis represents N, which is the max potential value in the collection. Darker colors represent more actual entries in the collection. The downwards slope to the right indicates that for a given collection size, a set/dict will get more efficient the more times it is queried. Another possibility is that the collection gets more efficient the sparser the data that it contains, but this seems unlikely. Why would a collection containing 10,20,30 be more efficient than one containing 100,200,300? The most inefficient query is the one where the collection is fully populated. Sets/Dicts use hashing. Hashes usually work by using a function called "the hash function" to turn the contents into a numeric key. This key is used to find a position in an array. In the case of collisions, the next position is checked to determine if the key is located there. When the collection gets full it takes longer to see if the key is present.

The next graph shows something different: The shading and the X axis have been swapped, so that the X axis represents the number of items in the collection, while the shade represents the maximum value that is present.




The time to test membership in lists seem to be entirely dependent upon the length of the list. This makes sense if the collection is generally sparsely populated, and it must generally search the whole list before confirming that the item is not there. (Even if it is not there, the complexity is O(n), so it is still linear.)

Perhaps less obvious is that dictionaries and sets perform best, and identically, until the time that they near full population in the collection. I.e., if the dict/set has X members, the numbers that are contained are between 0 and X. Presumably this is the effect of hash slot collision... the tight clustering of values may be causing more has collisions.

Also, the time to find an object in a dictionary is fairly constant, but it does appear to be rising slowly as the number of entries in the dictionary is increased.



If the hash function is complex relative to the comparison function, and if the most likely values to be searched for are located earliest in the list, then it may make sense to use a list instead of a dict. Frankly, however, I think the graph above really demonstrates how extreme the situation must be for this to be true. It was generated with 0.9 probability, so that it was 90% likely that the first item is the one requested. Failing that, it is 90% likely to ask for the second item, and so on. That is a pretty extreme case. Hooray for efficient dicts and sets!

See also:

No comments:

Post a Comment