Two Weeks with Python
For the last few weeks I have been doing some work with Python. Most of the development work I do is in Java or Groovy. For this one project I was working on, I need a very lightweight language that I could use very a very small application that is called often from the command line. After some performance testing between a simple test between Perl and Python, I found that for the tasks I was doing Python seemed to run a bit quicker. I was quite pleased at this because I have wanted to learn Python for some time now but never had a reason to do so.
There are a lot of good things about Python. It is very dynamic and allowed me to say more in a few lines that I could in something like Java. The standard library in Python is very good. They pretty much had everything I needed to write my application without having to go download a bunch of third party dependencies like I would in Java (ie. Apache Commons Stuff).
The main issue I was having is that I needed a queue for my application that was persistable to disk. This is part of a guaranteed delivery system so I need to always keep the data written out on the disk. There are a few options in Python for persistence. All of them are based around Pickle which is Python’s object serialization.
For performance reasons I did not want to serialize the queue to disk. The queue will often have items added to it or removed from it so it will be constantly shrinking and growing. I did not want to have to write the entire file to disk every time a simple change was made to the queue.
This is about the time when I found Shelve, Python’s object persistence mechanism. It allows for key/value pair, dictionary or map style access to data in a ‘database’. Database here is a dbm or a Unix style database. I had dome some work with Berkeley DB2 before so I was familiar with the concept. This is what I wanted to use to store my data. The only problem was that it was essentially a dictionary and not a queue.
Dictionary base Queue
I set out to write an implementation of a simple queue in Python that stored all of the data into a dictionary. In my case I would be using a Shelf but it could really be backed by any dictionary. This would allow an entry to be added or removed from the queue easily without having to write the entire queue to the disk every time.
The data structure works like this:
- All keys in the dictionary are always numbers. The numbers typically start at 1 but can be any continuous range. It is important they are continuous or the algorithm will not owrk.
- The data strcture keeps track of the minimum key index which is the head of the queue and the maximum key number which is the tail of the queue.
- When the queue is created, it iterates of all keys in the dictionary to determine the min and max key numbers. This is the only call to get the keys from the underlying dictionary. Using Shelve, the keys() method is very expensive so we want to call this as little as possible.
- The dictionary is Thread safe. Locking is done around methods that need to modify the queue. This was a requirement for the code that was using this implementation.
- head – Gives the item at the head of the queue. Does not remove it.
- pop – Pops the head item off the queue, removing it.
- enqueue – Adds an item to the queue.
- size – Returns the size of the queue.
- sync – Syncs the shelve to disk. Called after each operation to modify the queue. This only works if the dictionary is a shelve.
class DictQueue: lock = threading.Lock() min_key = 1 max_key = 0 def __init__(self, dict): self.dict = dict keys = dict.keys() if len(keys) > 0: self.min_key = keys for k in keys: i = int(k) if i > self.max_key: self.max_key = i if i < self.min_key: self.min_key = i def head(self): with self.lock: try: return self.dict[str(self.min_key)] except: return None def pop(self): with self.lock: k = self.dict.pop(str(self.min_key)) self.min_key += 1 self.sync() return k def enqueue(self, value): with self.lock: self.max_key += 1 self.dict[str(self.max_key)] = value self.sync() def size(self): return self.max_key + 1 - self.min_key def sync(self): try: self.dict.sync() except: pass
s = DictQueue(shelve.open('db')) s.enqueue('a') s.enqueue('b') s.enqueue('c') print 'size:', s.size() while self.queue.size() > 0: print 'item:', s.pop()