Spliticket running again with BeautifulSoup

Sunday, September 7, 2008 - 18:10

Ages ago Matthew Somerville emailed me to say that spliticket had fallen over. It's my hacky interface to his wiki page documenting split tickets, and ultimately it found the vagaries of even wiki-generated HTML a bit too hard to cope with.

At the time I built the HTML parser using core SAX-based HTML parsing, and it was horrible. SAX works in a basic sense, but you have to build your own internal state engine, track which elements have gone past while working out what to do with the current context, and even write rules for what to do when the underlying dumb parser encounters HTML entities: no mean feat when the document is peppered with – en dashes.

Not only was writing the rules initially a pain in the rear, but adding new rules and bugfixing the existing ones was even worse. But I lived with SAX, because I was deploying on shared hosting: I presumed that this was the best option available if I couldn't install any new shared libraries.

Not true! I've just rebuilt the entire parsing layer with Beautiful Soup, a Python HTML/XML parser library which (a) is available as a single file and (b) works out a decent HTML DOM tree from pretty much anything you throw at it.

Try it yourself, if you have to do any HTML parsing.It's astonishing; beautiful, in fact. I will never write another SAX parser ever again, which I'm sure I've said before.

Comments

jps (not verified)

Mon, 08/09/2008 - 08:49

Permalink

(I should add, for anyone thinking: "well, duh!" that I did know about teh soup before now; I just saw the non-monolithic install options and presumed it wouldn't be installable on shared hosting.)

I'm a Drupal Association member!

My individual membership of the Drupal Association

Drupal 8 API tutorials

Want to learn about the Drupal 8 APIs, with worked examples? Follow my series of tutorials, covering routing, caching, entities, config and much more!

Want to hire me?

I'm currently

fully booked

Blog category:

formats, hacking, import/export, projects

Recent blogposts

Altering the length of a Drupal 8 text field that contains data

Friday, July 21, 2017 - 11:31
A menagerie of testing: behavioural, unit, system, smoke, regression, oh my!

Friday, June 2, 2017 - 10:11
Including Javascript in Behat tests, all inside a headless, virtual machine

Tuesday, May 30, 2017 - 16:51

All blogposts

About me

I'm J-P Stacey, and I'm a freelance technical developer and software architect, working with Drupal, Javascript, Symfony, PHP and devops, with experience in project and process management and an emphasis on usability.

I live in the UK; my website is self-hosted on bigv.io; my email is hosted by Google, and that's also what I use to share files. (More info|What is this?)

Secondary menu

Spliticket running again with BeautifulSoup

Comments

(I should add, for anyone

I'm a Drupal Association member!

Drupal 8 API tutorials

Want to hire me?

Blog category:

Tags:

Recent blogposts

Altering the length of a Drupal 8 text field that contains data

A menagerie of testing: behavioural, unit, system, smoke, regression, oh my!

Including Javascript in Behat tests, all inside a headless, virtual machine

About me

Find me elsewhere