Keeping my Web Site Fresh with Wfreshen

Nowadays my web presence (such as it is) is split over several sites (Flickr, Delicious, LiveJournal, etc., as well as this site). I want my home page to include pointers to these other sites--such as the Flickr badge I described in the previous article. To do this I need to download feeds from these other sites in order to mix them in to my home page. I wrote a Python program called wfreshen to do this.

This article describes how wfreshen works. You can download wfreshen.py; it needs Python 2.3 and Yaml.

What it Does

The program wfreshen by default reads a list of web page addresses (URIs) and file names from wfreshen.cfg. For each URI, it downloads the resource and writes it to the file. So far this is not unusual. What it also does is keep a simple metadata database of when you last downloaded the resource, so that if there have been no changes, it can use the cached copy instead of downloading the whole thing.

In particular, it recognizes the ETag and Last-Modified headers in the downloaded resource, and squrrels them away in metadata.yml to use next time in If-None-Modified and If-Modified-Since headers; the remote web server will then return a status code of 304 if the resource in question is unmodified.

The request also includes an Accept-Encoding:gzip header. This allows web sites to respond with a compressed version of the resource in question; wfreshen will uncompress it automatically. This, together with the support for 304 responses, is the bare minimum for a polite web robot.

Finally, when writing the resource data to disc, wfreshen checks that the data is different from last time. If it is not, the file is not touched. This means that you can combine wfreshen with make to update files, confident that the processing specified by your makefile will only happen if the remote resource has been updated. This is useful when (as in the case of some of my web site updates) the processing takes some time.

Example Use

You must create a configuration file named wfreshen.cfg formatted like this:

- uri: http://del.icio.us/rss/pdc
  file: pdc.del.icio.us.xml
- uri: http://damiancugley.livejournal.com/data/atom
  file: pdc.livejournal.atom

The confiuration follows the Yaml conventions: each resource is introduced with a hyphen (standing in for a bullet) and has two attributes labelled uri and file.

When I run it it prints the following rather dull log of its activities:

Processing wfreshen.cfg ...
  - uri: http://del.icio.us/rss/pdc
    not read because: not modified
  - uri: http://damiancugley.livejournal.com/data/atom
    read bytes: 54175
    wrote to: pdc.livejournal.atom
Wrote metadata to metadata.yml

We can now look in the files specified to see the feed data. We can also peek at the stored metadata by examining metadata.yml:

--- 
http://damiancugley.livejournal.com/data/atom: 
  checked: 2006-02-04T11:11:27
  file: pdc.livejournal.atom
  hash: c7e96ae701ab65cf96614d5cd02f50ddf083947d
  date: Sat, 04 Feb 2006 11:11:27 GMT
http://del.icio.us/rss/pdc: 
  etag: "31313338393639363136"
  hash: da6c80ab7c67cc6395337354094cbd338fc425cc
  date: Sat, 04 Feb 2006 09:40:01 GMT
  checked: 2006-02-04T11:11:26
  file: pdc.del.icio.us.xml

The date and etag fields--if any--come from the remote web site. It turns out that LiveJournal do not use etags. The hash field is generated by wfreshen, and will be used next time to check for changes to the downloaded data.

Dependency on Yaml

There is one dependency that is a little tricky: I use Yaml as the file format for the metadata storage and for the configuration file. This is tricky only because there is not really a standard Python implementation of Yaml yet. I use the old PyYaml, but that appears to be abandoned, and the new PyYaml is still being worked on. There is an implementation using Syck called PySyck. This still claims to be beta software, and there appear to be quibbles regarding its Unicode support (Trac ticket 37), but I don't need Unicode for wfreshen.py, so I installed PySyck 0.55.1--and it works. In order to allow for either library to be used, I added this code to the front of wfreshen.py:

try:
    import syck
except ImportError:
    syck = None
    import yaml

def yamlLoadFile(fileName):
    """Load a Yaml file containing a single document."""
    if syck:
        input = open(fileName, 'r')
        try:
            result = syck.load(input)
        finally:
            input.close()
        return result
    else:
        return yaml.loadFile(fileName).next()

def yamlDumpToFile(obj, fileName):
    """Write a Yaml file containing this object."""
    output = open(fileName, 'w')
    try:
        if syck:
            syck.dump(obj, output)
        else:
            yaml.dumpToFile(output, obj)
    finally:
        output.close()

Updating my web site

I have a script called updateAlleged that goes like this:

# /bin/sh

cd /Users/pdc/blah
wfreshen.py || exit 1

cd blahblah
make install

At the moment I run it by hand--because my laptop is not always switched on, it can't be run from cron in any useful fashion. I need to research launchd so I can see whether it solves the can't-run-if-switched-off problem.

Future Work

I have not yet done the work required to make wfreshen a complete Unix command: it needs --help and --version command-line options, and it should be possible to specify URIs in the command line as well as in the configuration file, and so on. Also, it could do with an installer--even if it will be a trivial one.

Consider whether I need to add support for redirections. Read through Mark Pilgrim's Web Services chapter to see what I have missed.

Update (12 Feb 2006). Joe Gregorio has written an article Doing HTTP Caching Right: Introducing httplib2 which rather steals my thunder by covering more of HTTP's caching conventions than I do here. At some future date I may change wfreshen to exploit httplib2 rather than doing the work itself.