THE WORLD DISCUSSES #PYTHIAN ON TWITTER. HAVE A QUESTION? USE OUR HASHTAG AND ASK AWAY.

Exporting Old use.perl.org Blog Entries

This week-end I finally got around importing all my old use.perl.org blog entries to Fearful Symmetry. To ease off the migration, I ended up writing two itsy-bitsy scripts. They’re nothing fancy, but in case they might help someone, here they are.

Harvest the entries

This was easy. For each account, use.perl.org has a journal entries listing page. So the whole operation consisted of grabbing that webpage and mirroring everything on it looking like a journal entry. Not terribly sophisticated, but for this specific job it’s all we need.

Of the script itself, the most interesting part is LWP::Simple::getstore(). Most people know and use LWP::Simple::get(), but more than a few forget its sibling, which saves the retrieved webpage directly to a file — which is perfect for harvesting activities like this one.

#!/usr/bin/perl 

use 5.10.0;

use strict;
use warnings;

use LWP::Simple;

my $uid      = '3196';
my $username = 'Yanick';

my $main = get( 'http://use.perl.org/journal.pl?op=list&uid=' . $uid );

while ( my ($entry_id) =
    $main =~ m#//use.perl.org/~$username/journal/(\d+)#g ) {

    say "retrieving $entry_id...";

    getstore( "http://use.perl.org/~$username/journal/$entry_id", $entry_id );

    sleep 1;    # let's be nice to the server, shall we?
}

Extract the information off the harvested pages

As one might suspect, the harvested use.perl.org pages contain a little bit more than the raw blog entries. Getting to the information we want — the blog entry’s title, creation date, body, etc — is not hard, but it’s a little onerous to do by hand.

There are a lot of ways to extract information from a webpage, from quick and dirty regular expressions (like I did in for the script above) to full-fledged DOM parsing using, say, HTML::Tree. As I’m playing a lot with jQuery these days, I wondered if there was anything Perlish available offering the same type of interface. Guess what? There is: pQuery.

After playing with it a little bit, I’d say that pQuery is not quite as slick and ready for prime-time as its JavaScript forebear. But again, for this small task, it allowed me to do the job.

The resulting script is as straight-forward as they come. I used Firebug to find out which html elements I want, tested the resulting paths with jQuery and, once I was happy with the result, adapted the result to pQuery.

#!/usr/bin/perl 

use 5.10.0;

use strict;
use warnings;

use pQuery;
use utf8;    #unless you want xml, you can skip utf8'ing the output

$/ = undef;  # it's slurping time

my $p = pQuery(<>);

say "title: ", $p->find('.title h3')->get(1)->innerHTML;

my ( $month, $day, $year ) =
  $p->find('.journaldate')->html() =~ /(\w{3})\w* 0?(\d+), (\d{4})$/;
say "date: ", "$day $month $year";

say "original url: http:"
  . $p->find('.h-inline a')->get(0)->getAttribute('href');

say "\n";

utf8::encode( my $entry = $p->find('.intro')->get(0)->innerHTML );

say $entry;

It’s harvesting time

With those two scripts ready to go, the harvesting process becomes much less of a chore:

$ perl files/harvest_entries.pl
retrieving 38951...
retrieving 38951...

$ perl files/extract_entry.pl 38951
title: Breaking off from the use.perl.org mothership
date: 10 May 2009
original url: http://use.perl.org/~Yanick/journal/38951

<p>
For the last couple of months, as a concession between
visibility and control, I'd been double-posting my blog
entries both here and on my
personal blog.
But now that my blog is registered on both the
<a href="http://perlsphere.net/" rel="nofollow">Perlsphere</a> and
<a href="http://ironman.enlightenedperl.org/" rel="nofollow">IronMan</a> aggregators,
the need for the second posts here has dwindled.  So... I'm going
on a limb and tentatively turn off the echoing.
See y'all on <a href="http://babyl.dyndns.org/techblog" rel="nofollow">Hacking Thy Fearful
Symmetry</a>!</p>

Of course, there is still the grooming of the use.perl.org html, and the actual importing to the new blogging engine. But… surely a handful of other scripts can take care of that, right? :-)

OOW10 Bloggers Meetup Agenda — T-shirts are Back and More…

Almost time for the Annual Bloggers Meetup @ OOWCounting down. The details are finally organized — this year, we have not one, but TWO great prizes at the Oracle OpenWorld Bloggers Meetup.

1) T-shirt art contest on stylish Pythian designer t-shirts — one lucky blogger will receive an HP X310 Data Vault, generously sponsored again this year by HP.

2) For the best, most creative blog post about the meetup itself, Pythian is giving away an Apple TV. But, there are a few small rules:

  1. the blog post must use as many names of people in attendance as possible.
  2. the blog post must be readable. It needs to make sense to someone who wasn’t there. It must be a story and not a list.
  3. Read the rest of this entry . . .

Welcome Chen Shapira to Pythian

I’m excited to announce that Chen Shapira has started with Pythian this week. :) Chen is no stranger here and many of my colleagues already know her and were in touch — she just naturally fits in.

Chen is the world class production Oracle DBA. She has been maintaining a popular Oracle blog and is a great addition to Pythian bloggers — Chen posted on Pythian blog before her actual official join date authoring Log Buffer last week. She is also on Twitter and you can follow her @gwenshap. Turns out that even Pythian SQL Server DBA’s are frequent readers of her blog — who knew?

Chen is also a frequent presenter at the conferences such as RMOUG10, Hotsos09, OOW09 and OOW08.

Chen is an Oracle ACE and also a member of OakTable Network. She is very active in her local user group, Northern California Oracle User Group (NoCOUG), carrying duties of the Training Day Coordinator.

Welcome Chen to the Pythian team! I’m sure you are already working on acceptance testing of that new Oracle 11g RAC cluster that is slated to go live… eh… this weekend? ;-)

The Joy of Finding Your Code in Unexpected Places

Lotsa penguins

picture by Geophaps
Hey, that one in the sixth row…
Doesn’t he looks familiar?

So there I am, on my morning bus ride, reading my copy of The Definitive Guide to Catalyst (keep your eyes peeled for the upcoming review of the book in the Perl Review).

I’m near the end, in Chapter 11, Catalyst Cookbook. As it is with most tech books, the last chapters are the most engrossing, as the gloves finally come of and the writers throw at you all the wonderful, mind-bending stuff that the rest of the book prepares you for.

The section I’m at is about the development process. Specifically, it shows how you can put hooks in your versioning system to automatically screen commits to conform to Perl::Critic and Perl::Tidy policies. The given example script uses Git, which is just dandy with me as it is my current VCS of choice. But there’s something . . .  funny about that script. The way the utility functions are stashed at the end after a

### utility functions ##############################

line. The choice of variable names. The comments. It all feels oddly familiar. Read the rest of this entry . . .

Bloggers Meetup @ Oracle Open World 2009

Are you an Oracle blogger attending Oracle Open World 2009?

If so, you are invited to attend this Oracle Bloggers Meetup during OOW 2009 — a chance to meet your online buddies face-to-face in relaxed and informal atmosphere.

What: OOW 2009 Bloggers Meetup

When: Tue, 13-Oct, 6:00pm

Where: LJ’s Martini Club & Grill @ Metreon 2nd Floor, 101 4th Street, San Francisco Updated: 13-Oct!

It’s a big disappointment that Eddie Awad is not going to be with us at the Oracle Open World this year… But the show must go on and Oracle Bloggers Meetup must happen again this year so I’m picking up the baton from Eddie and will organize the meetup this year with the help of Justin Kestelyn and Lillian Buziak (Oracle ACE Wrangler).

First things first, thanks to OTN for sponsoring our gathering again — just like the last year, we will have drinks served for a while. But there are some differences from the previous years…
Read the rest of this entry . . .

Alex Gorbachev’s RSS Feeds Aggregated

Back in May 2006, I have started my blog using the Blogger platform and one month later moved it to my own website using WordPress. Couple month later, I joined Pythian and, since then, the vast majority of my blogging activities has been on the Pythian Group Blog.

The Pythian blog has grown significantly since then and many more excellent authors started blogging there. While the Pythian blog was mostly focused on Oracle database just a couple years ago, it’s has got very broad coverage now and is including MySQL, SQL Server and Oracle databases as well as Oracle Application Server, Oracle eBusiness Suite and other enterprise software. While I think this is a great opportunity to entend your area of interests, it might be just too much for some as few people already complained and unsubscribed to avoid being overwhelmed with information. That was painful to hear!

First of all, I should say that there is a way to subscribe only to a selected category or a single author — just add /feed/ at the end of pretty much any page. For example, all my blog posts can be seen using URL http://www.pythian.com/blogs/author/alex and RSS feed URL would be http://www.pythian.com/blogs/author/alex/feed/. Likewise, the Oracle category RSS feed is http://www.pythian.com/blogs/category/oracle/feed/. Read the rest of this entry . . .

Start NowWith Pythian - database design, management and emergency handling capabilities...

Live Updates

pythian: RT @FN_Press2: Schooner Information Technology Teams with Pythian to Deliver Advanced Support and High... http://finanznachrichten.de/20
more



Testimonials

  • Serge Racine

    DBA, Brookfield Energy

    We are very satisfied by the service given to us by Andre and Shakir in support of our recent data quality and reorganization initiative.... more