Sunday, March 28, 2010

Fast path walking: taking a page from MPD

Uctui needs to maintain an internal listing of all of the music a user has so that it can upload this information to MusicBrainz and give you feedback on it; however, parsing large nested directories, like music directories, is a taxing operation. Especially since uctui needs to read the metadata from the beginning of each of these files.

It had become unacceptable for us to run this walk operation on each invocation of uctui. Following the lead of MPD (music player daemon), a program with a similar problem, we needed to serialize and output our internal state and load this at the beginning of each run. Enter pickle, python's serialization class.

This is was not the only issue. While we now can recover our state from our last run, what happens if the user added new files or updated others? Without proper strategy we are back where we started - reading the metadata from all the files in the music directory. To avoid this we have modified our internal state we keep on the music directory to be a dictionary of form:

path-->(mtime, mdata)

We use paths as keys (I know, inodes would probably be preferred, but for now we're not handling symlinks and I can't think of any other reason not to use paths) which allow us to do quick (this is a hash table) lookups to compare if we have seen this file before. If we have seen it before it would still be too costly to blindly read the metadata, so taking a page from MPD, we compare the mtime with the one we have stored in the tuple in our dictionary and if they differ than we have no choice but to read the metadata as this implies the file has been modified.

By using this strategy, we have achieved update times that compare favorably to MPD - initial scans have to parse every file in the tree, additional scans quickly scan the tree for changes and parse only changed/new files.