Sunday, October 28, 2012

Asp forms with doPostBack using ruby mechanize

Here's a simple scraping problem that causes problems for lots of people. How to page aspx search results that use doPostBack actions. I'm basing this on an older ScraperWiki post that used python.

doPostBack is not as scary as people think. It basically takes 2 arguments, sets the form values and submits the form. I'll make it simple by monkey patching it into Mechanize::Form

require 'mechanize'

class Mechanize::Form
  def postback target, argument
    self['__EVENTTARGET'], self['__EVENTARGUMENT'] = target, argument
    submit
  end
end

The rest is simple. Find the 'Next' link, parse out the values and send them to Form#postback. Put it in a while loop and you've got paging.
agent = Mechanize.new
page = agent.get 'http://data.fingal.ie/ViewDataSets/'

while next_link = page.at('a#lnkNext[href]')
  puts 'I found another page!'
  target, argument = next_link[:href].scan(/'([^']*)'/).flatten
  page = page.form.postback target, argument
end

The result is much cleaner than what I've seen from the python side. Ruby's mechanize is sophisticated enough to avoid all of the many pitfalls of its python counterpart. No wonder I like it so much!

No comments:

Post a Comment