Some people have trouble scraping websites that require a login. I'm going to demonstrate how to do one that I've seen some people have trouble with, namely stubhub. I will scrape it using ruby mechanize, and then, just for fun, I will do it in php.
First instantiate your Mechanize object and turn off ssl verification:
require 'mechanize'
@agent = Mechanize.new
@agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
Rather than go straight to the 'Sign in' link, we'll go to homepage, and then 'click through' to the sign in page. This will more accurately mimic real browser behavior.
page = @agent.get 'http://www.stubhub.com/'
page = page.link_with(:text => 'Sign in').click
Find the form, and fill out the login credentials. Notice that I use page.forms[1], this is because the login form is the second form on that page. If you're not sure which form it is you might want to throw in a
binding.pry
at this point and inspect page.forms.
form = page.forms[1]
form['loginEmail'] = email
form['loginPassword'] = password
That's it. Submit the form and let's see if it gives us the log in text we expect.
page = form.submit
puts page.at('ul#headermenu li').text
The output says:
Hi P!
which means it worked and I'm logged in. Now let's cross our fingers and see if it's that easy in php using
PGBrowser:
require 'pgbrowser/pgbrowser.php';
$b = new PGBrowser();
$page = $b->get('http://www.stubhub.com/');
$url = $page->at('//a[.="Sign in"]')->getAttribute('href');
$page = $b->get($url);
$form = $page->forms(1);
$form->set('loginEmail', $email);
$form->set('loginPassword', $password);
$page = $form->submit();
echo $page->at('//ul[@id="headermenu"]/li')->nodeValue;
It Works!