ScraperBlog: October 2012

Wednesday, October 31, 2012

Making on the fly changes to running ruby jobs

Let's say you've got a long running ruby script: and you need to be able to make config changes. But you don't want to stop the program because then you'd have to start all over. Add this to the top of the script

require 'pry'
trap("INT"){binding.pry}

And now you can interrupt it with a crl-c. After making changes to the environment exit will let the script resume. Use exit-program if you really want the script to exit.

For example I might use this to change the proxy my agent is connecting through on a long scrape job when things gets a little slow.

Sunday, October 28, 2012

Asp forms with doPostBack using ruby mechanize

Here's a simple scraping problem that causes problems for lots of people. How to page aspx search results that use doPostBack actions. I'm basing this on an older ScraperWiki post that used python.

doPostBack is not as scary as people think. It basically takes 2 arguments, sets the form values and submits the form. I'll make it simple by monkey patching it into Mechanize::Form

require 'mechanize'

class Mechanize::Form
  def postback target, argument
    self['__EVENTTARGET'], self['__EVENTARGUMENT'] = target, argument
    submit
  end
end

The rest is simple. Find the 'Next' link, parse out the values and send them to Form#postback. Put it in a while loop and you've got paging.

agent = Mechanize.new
page = agent.get 'http://data.fingal.ie/ViewDataSets/'

while next_link = page.at('a#lnkNext[href]')
  puts 'I found another page!'
  target, argument = next_link[:href].scan(/'([^']*)'/).flatten
  page = page.form.postback target, argument
end

The result is much cleaner than what I've seen from the python side. Ruby's mechanize is sophisticated enough to avoid all of the many pitfalls of its python counterpart. No wonder I like it so much!