Wednesday, December 5, 2012

Scraping a site that requires login in ruby/php

Some people have trouble scraping websites that require a login. I'm going to demonstrate how to do one that I've seen some people have trouble with, namely stubhub. I will scrape it using ruby mechanize, and then, just for fun, I will do it in php.

First instantiate your Mechanize object and turn off ssl verification:
require 'mechanize'
@agent =
@agent.verify_mode = OpenSSL::SSL::VERIFY_NONE

Rather than go straight to the 'Sign in' link, we'll go to homepage, and then 'click through' to the sign in page. This will more accurately mimic real browser behavior.
page = @agent.get ''
page = page.link_with(:text => 'Sign in').click

Find the form, and fill out the login credentials. Notice that I use page.forms[1], this is because the login form is the second form on that page. If you're not sure which form it is you might want to throw in a binding.pry at this point and inspect page.forms.
form = page.forms[1]
form['loginEmail'] = email
form['loginPassword'] = password

That's it. Submit the form and let's see if it gives us the log in text we expect.
page = form.submit
puts'ul#headermenu li').text

The output says: Hi P! which means it worked and I'm logged in. Now let's cross our fingers and see if it's that easy in php using PGBrowser:
require 'pgbrowser/pgbrowser.php';

$b = new PGBrowser();

$page = $b->get('');
$url = $page->at('//a[.="Sign in"]')->getAttribute('href');
$page = $b->get($url);

$form = $page->forms(1);
$form->set('loginEmail', $email);
$form->set('loginPassword', $password);
$page = $form->submit();

echo $page->at('//ul[@id="headermenu"]/li')->nodeValue;

It Works!

No comments:

Post a Comment