Ages ago Matthew Somerville emailed me to say that spliticket had fallen over. It's my hacky interface to his wiki page documenting split tickets, and ultimately it found the vagaries of even wiki-generated HTML a bit too hard to cope with.
At the time I built the HTML parser using core SAX-based HTML parsing, and it was horrible. SAX works in a basic sense, but you have to build your own internal state engine, track which elements have gone past while working out what to do with the current context, and even write rules for what to do when the underlying dumb parser encounters HTML entities: no mean feat when the document is peppered with – en dashes.
Not only was writing the rules initially a pain in the rear, but adding new rules and bugfixing the existing ones was even worse. But I lived with SAX, because I was deploying on shared hosting: I presumed that this was the best option available if I couldn't install any new shared libraries.
Not true! I've just rebuilt the entire parsing layer with Beautiful Soup, a Python HTML/XML parser library which (a) is available as a single file and (b) works out a decent HTML DOM tree from pretty much anything you throw at it.
Try it yourself, if you have to do any HTML parsing.It's astonishing; beautiful, in fact. I will never write another SAX parser ever again, which I'm sure I've said before.
Comments
jps (not verified)
Mon, 08/09/2008 - 08:49
Permalink
(I should add, for anyone
(I should add, for anyone thinking: "well, duh!" that I did know about teh soup before now; I just saw the non-monolithic install options and presumed it wouldn't be installable on shared hosting.)