2015-02-16

Performance of Membership Tests


Some people think that performance in Python is inscrutable. But surely there is one thing we can all agree on... membership tests should be done on something hashed, like dictionaries or sets, not on a list.

Metrics: M values between 0 and N-1 are in a collection. How long does it take to test for each of N possibilities in turn? These values are the total search time divided by the number of searches, so they are an average time per single lookup.



Analysis

The list is, unsurprisingly, a couple of magnitudes worse than the set and dict. Note that the amount of time it takes to lookup a value in a list is linear with the number of elements. (The value of each increasing dot increases exponentially, so a constant distance between dots on a log graph implies linear growth.) This makes intuitive sense.

The time it takes to look up a value in a set or dict is suspiciously identical. In fact, they use the same algorithm and share some code. You can think of sets as dictionaries with an unused value.

Digression - Python implementation of dict

In Python a dict is a hash map. (Maps don't need to be hash mapped; they could be a binary tree like in C++.) Basically, the key is turned into an integer value by way of the hash function, __hash__. The hash value modulo the size of the hash indicates into which slot the key/value pair is inserted. Sometimes, multiple keys will hash to the same slot and "collide". When this happens, there are two solutions: open or closed addressing. With open addressing (such as for a Python dict) the next1 slot is used. With closed addressing, each slot is a linked list of all entries which hash to that slot2.
Clearly, open addressing cannot contain more key/value pairs than there are slots, so they will automatically resize when they get nearly full. There is a fairly severe performance degradation that happens when the hash begins to be about three quarters full; hashes typically resize at a given occupancy rate3.


Notes:

1: ... for certain values of "next". Sometimes the next slot is the previous slot index plus one and progresses linearly, other times it progresses quadratically. Other times it may be based upon the hash (before the modulo). It may take into account how many times it has tried to find a slot. The exact algorithm doesn't matter just so long as it is repeatable and gets to a blank slot quickly.

2: Using a linked list is simpler than scanning ahead, but it has problems of its own. There can be serious performance degradation if the keys hash to the same slot (not just the same hash value, which would trip up open addressing as well). This makes the closed addressing solution particularly unsuitable where the keys may be chosen maliciously. Using something like a tree rather than a list would help, but not eliminate, the issue. Open addressing could also be susceptible to malicious keys, although such implementations must support dynamic resizing, which helps a bit (both in complicating the attack and automatically resolving one).

3: Like any resize operation, the complexity is probably O(n), but amortized to O(1). Some applications may need better time guarantees. It is possible to incrementally resize the hash. For instance, any addition could be added to the new, double-sized hash, while queries could check both hash locations. The hashing function is probably the most expensive operation, and that can be reused. The modulo would be done twice. Any query could also migrate the value to the new hash. (But the slot should still be marked as occupied, or else searching for collided values may fail). Any insert should be accompanied with a migration of an old value. That way, it is guaranteed that the old hash has been completely migrated by the time the new (double-sized) hash has filled up.


Graph Analysis

On the first graph the X axis represents N, which is the max potential value in the collection. Darker colors represent more actual entries in the collection. The downwards slope to the right indicates that for a given collection size, a set/dict will get more efficient the more times it is queried. Another possibility is that the collection gets more efficient the sparser the data that it contains, but this seems unlikely. Why would a collection containing 10,20,30 be more efficient than one containing 100,200,300? The most inefficient query is the one where the collection is fully populated. Sets/Dicts use hashing. Hashes usually work by using a function called "the hash function" to turn the contents into a numeric key. This key is used to find a position in an array. In the case of collisions, the next position is checked to determine if the key is located there. When the collection gets full it takes longer to see if the key is present.

The next graph shows something different: The shading and the X axis have been swapped, so that the X axis represents the number of items in the collection, while the shade represents the maximum value that is present.




The time to test membership in lists seem to be entirely dependent upon the length of the list. This makes sense if the collection is generally sparsely populated, and it must generally search the whole list before confirming that the item is not there. (Even if it is not there, the complexity is O(n), so it is still linear.)

Perhaps less obvious is that dictionaries and sets perform best, and identically, until the time that they near full population in the collection. I.e., if the dict/set has X members, the numbers that are contained are between 0 and X. Presumably this is the effect of hash slot collision... the tight clustering of values may be causing more has collisions.

Also, the time to find an object in a dictionary is fairly constant, but it does appear to be rising slowly as the number of entries in the dictionary is increased.



If the hash function is complex relative to the comparison function, and if the most likely values to be searched for are located earliest in the list, then it may make sense to use a list instead of a dict. Frankly, however, I think the graph above really demonstrates how extreme the situation must be for this to be true. It was generated with 0.9 probability, so that it was 90% likely that the first item is the one requested. Failing that, it is 90% likely to ask for the second item, and so on. That is a pretty extreme case. Hooray for efficient dicts and sets!

See also:

2015-02-09

Setting up clamav antivirus on Ubuntu

Installing Anti-virus on Ubuntu

My wife's been using a Linux box in the kitchen as her primary web browsing computer. It also hosts my version control servers that back up everything that matters in the world. I figured that it was time I installed some anti-virus on it. Clamav seemed to be the simplest/best option.

The only hitch is that I don't get to sit at the computer much. Mostly I SSH in from the bus, but I do that infrequently. I can crontab the scan, but I really need the results pushed to me. For another program I've written a module that will send an email using a secondary gmail account, so I just needed to hook clamav up to it.

cp_email.py

cp_email.py takes command line parameters to indicate how to send the email, and then runs a command and sends the results. This was easier than setting up email on the Linux box so that it could send email natively. Cron can email the results, but I didn't want to hook that up. This way I can add wrapping code to do arbitrary post-processing (filters, summarizing). It is also more easily portable.

I was rather concerned with security. It would be foolish to include a plaintext password on the command line, as that can be seen by all processes running on the machine. The --ob will perform a trivial de-obfuscation on the password. Each character will be converted into the preceding ascii value. If the password is "cat", the obfuscated password (that should be given to cp_email) is "dbu". This obviously cannot stop a mildly determined attacker. The preferred method of specifying a password is by reference to a text file. A password preceded by an at sign ("@") is taken to be a filename. The file is loaded. If there are multiple lines in the file then the password is the taken from the last line in the file. This method can also be used to specify the username, where the first line of a multi-line file is taken. This allows the username and password to specified in the same file, which should, of course, be read protected from the world. Only the user should be able to read it. It should also be ignored by version control so that it is not available to all those who can access the source. @ and --ob can be used together for a small extra measure of security.


crontab -e

Here is my crontab entry:
0 4 * * * (rm /tmp/scan ; ((clamscan -i -l /tmp/scan -z --exclude-dir="^/(dev|cdrom|media/cdrom|sys)" -r /)) ; chmod a+r /tmp/scan ; ( cd /home/myusr/dir_with_password ; su -c "python /home/myusr/lib/cp_email.py --ob run-and-send @password @password recipient@email.com cat /tmp/scan --subject='Antivirus'" myusr))

I could have executed clamav directly from the cp_email script. However, clamav needs to run as root to be able to see all the files to scan and I didn't want root to be running a program which is held in version control and might change. If I did want to run it that way, then this would be the appropriate crontab entry:
0 4 * * * ( cd /home/usrofsvn/markets/Code/irrigate ; python ../lib/cp_email.py --ob run-and-send @password @password recipient@email.com clamscan -r / --subject="Antivirus run")

cp_email.py --help

usage: cp_email.py run-and-send [-h]
                                [--loglevel {CRITICAL,ERROR,WARNING,INFO,DEBUG}]
                                [--ob] [--subject SUBJECT]
                                username password recipients args [args ...]

positional arguments:
  username              The username to use to log into gmail. The username must

                        be an @gmail.com address. If preceded by @, then
                        the value indicates a filename. The first line of the
                        file contents will be used for the username.
  password              The password, or, if preceded by @, the filename where
                        the password is stored. If the file contains multiple
                        lines, it will take the password from the last line.
  recipients            Comma separated list of emails to receive email.
                        (Specify "-" for the sending username.)
  args                  The remaining arguments are the command to run.

optional arguments:
  -h, --help            show this help message and exit
  --loglevel {CRITICAL,ERROR,WARNING,INFO,DEBUG}
                        (default: INFO)
  --ob                  Enable elementary password obfuscation (ROT1)
  --subject SUBJECT     The subject for the email.