Nowadays my web presence (such as it is) is split over several sites
(Flickr, Delicious, LiveJournal, etc., as well as
this site). I want my home page to include pointers to these other sites--such as
the Flickr badge I described in the previous article.
To do this I need to download feeds from these other sites in order to
mix them in to my home page. I wrote a Python program called
wfreshen
to do this.
This article describes how wfreshen
works. You can download
wfreshen.py
; it needs Python 2.3 and Yaml.
What it Does
The program wfreshen
by default reads a list of web page addresses
(URIs) and file names from wfreshen.cfg
. For each URI, it downloads
the resource and writes it to the file. So far this is not unusual. What
it also does is keep a simple metadata database of when you last
downloaded the resource, so that if there have been no changes, it can
use the cached copy instead of downloading the whole thing.
In particular, it recognizes the ETag
and Last-Modified
headers in
the downloaded resource, and squrrels them away in metadata.yml
to use
next time in If-None-Modified
and If-Modified-Since
headers; the
remote web server will then return a status code of 304 if the resource
in question is unmodified.
The request also includes an Accept-Encoding:gzip
header. This allows
web sites to respond with a compressed version of the resource in
question; wfreshen
will uncompress it automatically. This, together
with the support for 304 responses, is the bare minimum for a polite
web robot.
Finally, when writing the resource data to disc, wfreshen
checks
that the data is different from last time. If it is not, the file is
not touched. This means that you can combine wfreshen
with make
to
update files, confident that the processing specified by your makefile
will only happen if the remote resource has been updated. This is useful
when (as in the case of some of my web site updates) the processing
takes some time.
Example Use
You must create a configuration file named wfreshen.cfg
formatted
like this:
- uri: http://del.icio.us/rss/pdc
file: pdc.del.icio.us.xml
- uri: http://damiancugley.livejournal.com/data/atom
file: pdc.livejournal.atom
The confiuration follows the Yaml conventions: each resource is
introduced with a hyphen (standing in for a bullet) and has two
attributes labelled uri
and
file
.
When I run it it prints the following rather dull log of its activities:
Processing wfreshen.cfg ...
- uri: http://del.icio.us/rss/pdc
not read because: not modified
- uri: http://damiancugley.livejournal.com/data/atom
read bytes: 54175
wrote to: pdc.livejournal.atom
Wrote metadata to metadata.yml
We can now look in the files specified to see the feed data. We can also peek at
the stored metadata by examining metadata.yml
:
---
http://damiancugley.livejournal.com/data/atom:
checked: 2006-02-04T11:11:27
file: pdc.livejournal.atom
hash: c7e96ae701ab65cf96614d5cd02f50ddf083947d
date: Sat, 04 Feb 2006 11:11:27 GMT
http://del.icio.us/rss/pdc:
etag: "31313338393639363136"
hash: da6c80ab7c67cc6395337354094cbd338fc425cc
date: Sat, 04 Feb 2006 09:40:01 GMT
checked: 2006-02-04T11:11:26
file: pdc.del.icio.us.xml
The date
and etag
fields--if any--come from the remote web site. It
turns out that LiveJournal do not use etags. The hash
field is
generated by wfreshen
, and will be used next time to check for changes to the
downloaded data.
Dependency on Yaml
There is one dependency that is a little tricky: I use Yaml as the
file format for the metadata storage and for the configuration file.
This is tricky only because there is not really a standard Python
implementation of Yaml yet. I use the old PyYaml, but that
appears to be abandoned, and the new PyYaml is still being worked
on. There is an implementation using Syck called PySyck. This
still claims to be beta software, and there appear to be quibbles
regarding its Unicode support (Trac ticket 37), but I don't need
Unicode for wfreshen.py
, so I installed PySyck 0.55.1--and it
works. In order to allow for either library to be used, I added this
code to the front of wfreshen.py
:
try:
import syck
except ImportError:
syck = None
import yaml
def yamlLoadFile(fileName):
"""Load a Yaml file containing a single document."""
if syck:
input = open(fileName, 'r')
try:
result = syck.load(input)
finally:
input.close()
return result
else:
return yaml.loadFile(fileName).next()
def yamlDumpToFile(obj, fileName):
"""Write a Yaml file containing this object."""
output = open(fileName, 'w')
try:
if syck:
syck.dump(obj, output)
else:
yaml.dumpToFile(output, obj)
finally:
output.close()
Updating my web site
I have a script called updateAlleged
that goes like this:
# /bin/sh
cd /Users/pdc/blah
wfreshen.py || exit 1
cd blahblah
make install
At the moment I run it by hand--because my laptop is not always switched
on, it can't be run from cron
in any useful fashion. I need to
research launchd so I can see whether it solves the
can't-run-if-switched-off problem.
Future Work
I have not yet done the work required to make wfreshen
a complete Unix
command: it needs --help
and --version
command-line options, and it
should be possible to specify URIs in the command line as well as in the
configuration file, and so on. Also, it could do with an installer--even
if it will be a trivial one.
Consider whether I need to add support for redirections. Read through Mark Pilgrim's Web Services chapter to see what I have missed.
Update (12 Feb 2006). Joe Gregorio has written an article Doing HTTP
Caching Right: Introducing httplib2 which rather steals my thunder by
covering more of HTTP's caching conventions than I do here. At some future date
I may change wfreshen
to exploit httplib2
rather than doing the work itself.