Exporting Old use.perl.org Blog Entries

Dec 12, 2010 / By Yanick Champoux

Tags: ,

This week-end I finally got around importing all my old use.perl.org blog entries to Fearful Symmetry. To ease off the migration, I ended up writing two itsy-bitsy scripts. They’re nothing fancy, but in case they might help someone, here they are.

Harvest the entries

This was easy. For each account, use.perl.org has a journal entries listing page. So the whole operation consisted of grabbing that webpage and mirroring everything on it looking like a journal entry. Not terribly sophisticated, but for this specific job it’s all we need.

Of the script itself, the most interesting part is LWP::Simple::getstore(). Most people know and use LWP::Simple::get(), but more than a few forget its sibling, which saves the retrieved webpage directly to a file — which is perfect for harvesting activities like this one.

#!/usr/bin/perl

use 5.10.0;

use strict;
use warnings;

use LWP::Simple;

my $uid      = '3196';
my $username = 'Yanick';

my $main = get( 'http://use.perl.org/journal.pl?op=list&uid=' . $uid );

while ( my ($entry_id) =
    $main =~ m#//use.perl.org/~$username/journal/(\d+)#g ) {

    say "retrieving $entry_id...";

    getstore( "http://use.perl.org/~$username/journal/$entry_id", $entry_id );

    sleep 1;    # let's be nice to the server, shall we?
}

Extract the information off the harvested pages

As one might suspect, the harvested use.perl.org pages contain a little bit more than the raw blog entries. Getting to the information we want — the blog entry’s title, creation date, body, etc — is not hard, but it’s a little onerous to do by hand.

There are a lot of ways to extract information from a webpage, from quick and dirty regular expressions (like I did in for the script above) to full-fledged DOM parsing using, say, HTML::Tree. As I’m playing a lot with jQuery these days, I wondered if there was anything Perlish available offering the same type of interface. Guess what? There is: pQuery.

After playing with it a little bit, I’d say that pQuery is not quite as slick and ready for prime-time as its JavaScript forebear. But again, for this small task, it allowed me to do the job.

The resulting script is as straight-forward as they come. I used Firebug to find out which html elements I want, tested the resulting paths with jQuery and, once I was happy with the result, adapted the result to pQuery.

#!/usr/bin/perl

use 5.10.0;

use strict;
use warnings;

use pQuery;
use utf8;    #unless you want xml, you can skip utf8'ing the output

$/ = undef;  # it's slurping time

my $p = pQuery(<>);

say "title: ", $p->find('.title h3')->get(1)->innerHTML;

my ( $month, $day, $year ) =
  $p->find('.journaldate')->html() =~ /(\w{3})\w* 0?(\d+), (\d{4})$/;
say "date: ", "$day $month $year";

say "original url: http:"
  . $p->find('.h-inline a')->get(0)->getAttribute('href');

say "\n";

utf8::encode( my $entry = $p->find('.intro')->get(0)->innerHTML );

say $entry;

It’s harvesting time

With those two scripts ready to go, the harvesting process becomes much less of a chore:

$ perl files/harvest_entries.pl
retrieving 38951...
retrieving 38951...

$ perl files/extract_entry.pl 38951
title: Breaking off from the use.perl.org mothership
date: 10 May 2009
original url: http://use.perl.org/~Yanick/journal/38951

<p>
For the last couple of months, as a concession between
visibility and control, I'd been double-posting my blog
entries both here and on my
personal blog.
But now that my blog is registered on both the
<a href="http://perlsphere.net/" rel="nofollow">Perlsphere</a> and
<a href="http://ironman.enlightenedperl.org/" rel="nofollow">IronMan</a> aggregators,
the need for the second posts here has dwindled.  So... I'm going
on a limb and tentatively turn off the echoing.
See y'all on <a href="http://babyl.dyndns.org/techblog" rel="nofollow">Hacking Thy Fearful
Symmetry</a>!</p>

Of course, there is still the grooming of the use.perl.org html, and the actual importing to the new blogging engine. But… surely a handful of other scripts can take care of that, right? :-)

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>