ScraperBlog: December 2012

Thursday, December 20, 2012

Easy Web Caching with VCR

Testing a scraper script sometimes means repeating a lot of http requests. Did you ever wish for an easy way to cache http responses to speed up your development? Here's an easy tip using ruby's vcr and fakeweb gems.

require 'vcr'
require 'fakeweb'

VCR.configure do |c|
  c.cassette_library_dir = 'cassettes'
  c.hook_into :fakeweb
  c.allow_http_connections_when_no_cassette = true
end

def cache cassette_name = 'my_cassette'
  VCR.use_cassette(cassette_name, :record => :new_episodes, :match_requests_on => [:method, :uri, :body]) do
    yield
  end
end

Save this to a file called 'cache.rb', and now you've got a simple way to cache requests in your scripts:

require 'mechanize'
require './cache.rb'

cache do
  @agent = Mechanize.new
  page = @agent.get 'http://www.amazon.com/'
  puts page.title
end

Thursday, December 6, 2012

Convert relative urls to absolute in php with Phpuri

Here's another common problem, how do I convert my relative urls to absolute urls in php? In most scripting languages there's some built in class that can do this for you. Unfortunately php is really a web development language so general purpose libraries can be lacking.

I tested two popular solutions against my test case and in the end decided to create a simpllified 'port' of ruby's URI class. Let's take a look at the competition:

rel2abs - The nicest thing I can say about this solution is that it's the fastest. Unfortunately it failed 30% of my tests.
Usage: rel2abs($rel, $base)
UrlToAbsolute - This one did fairly well, passing 90% of my tests. Keep in mind that many of the tests are rare edge cases, so I imagine real world success would be close to 100%. I could almost be happy with this one. Unfortunately the global namespace clutter it creates is a potential disaster, I decided it's best to steer clear of this one as well.
Usage: url_to_absolute($base, $rel)
Phpuri - While it passed 100% of the tests, I will concede that the deck was stacked. In other words I had the tests in mind while writing it and the goal was specifically to pass those tests.
Usage: phpUri::parse($base)->join($rel)

The Verdict:

Obviously I'm biased but I'm scoring this one for Phpuri.
Leave a comment if you disagree, I want to hear about it.

The Code:

require 'phpuri.php';
echo phpUri::parse('https://www.google.com/')->join('foo');
//==> https://www.google.com/foo

The Download:

View the project or download just the latest source

Wednesday, December 5, 2012

Scraping a site that requires login in ruby/php

Some people have trouble scraping websites that require a login. I'm going to demonstrate how to do one that I've seen some people have trouble with, namely stubhub. I will scrape it using ruby mechanize, and then, just for fun, I will do it in php.

First instantiate your Mechanize object and turn off ssl verification:

require 'mechanize'
@agent = Mechanize.new
@agent.verify_mode = OpenSSL::SSL::VERIFY_NONE

Rather than go straight to the 'Sign in' link, we'll go to homepage, and then 'click through' to the sign in page. This will more accurately mimic real browser behavior.

page = @agent.get 'http://www.stubhub.com/'
page = page.link_with(:text => 'Sign in').click

Find the form, and fill out the login credentials. Notice that I use page.forms[1], this is because the login form is the second form on that page. If you're not sure which form it is you might want to throw in a binding.pry at this point and inspect page.forms.

form = page.forms[1]
form['loginEmail'] = email
form['loginPassword'] = password

That's it. Submit the form and let's see if it gives us the log in text we expect.

page = form.submit
puts page.at('ul#headermenu li').text

The output says: Hi P! which means it worked and I'm logged in. Now let's cross our fingers and see if it's that easy in php using PGBrowser:

require 'pgbrowser/pgbrowser.php';

$b = new PGBrowser();

$page = $b->get('http://www.stubhub.com/');
$url = $page->at('//a[.="Sign in"]')->getAttribute('href');
$page = $b->get($url);

$form = $page->forms(1);
$form->set('loginEmail', $email);
$form->set('loginPassword', $password);
$page = $form->submit();

echo $page->at('//ul[@id="headermenu"]/li')->nodeValue;

It Works!