Here's a simple scraping problem that causes problems for lots of people. How to page aspx search results that use doPostBack actions. I'm basing this on an older
ScraperWiki post that used python.
doPostBack is not as scary as people think. It basically takes 2 arguments, sets the form values and submits the form. I'll make it simple by monkey patching it into Mechanize::Form
require 'mechanize'
class Mechanize::Form
def postback target, argument
self['__EVENTTARGET'], self['__EVENTARGUMENT'] = target, argument
submit
end
end
The rest is simple. Find the 'Next' link, parse out the values and send them to Form#postback. Put it in a while loop and you've got paging.
agent = Mechanize.new
page = agent.get 'http://data.fingal.ie/ViewDataSets/'
while next_link = page.at('a#lnkNext[href]')
puts 'I found another page!'
target, argument = next_link[:href].scan(/'([^']*)'/).flatten
page = page.form.postback target, argument
end
The result is much cleaner than what I've seen from the python side. Ruby's mechanize is sophisticated enough to avoid all of the many pitfalls of its python counterpart. No wonder I like it so much!