ScraperBlog: 2012

Thursday, December 20, 2012

Easy Web Caching with VCR

Testing a scraper script sometimes means repeating a lot of http requests. Did you ever wish for an easy way to cache http responses to speed up your development? Here's an easy tip using ruby's vcr and fakeweb gems.

require 'vcr'
require 'fakeweb'

VCR.configure do |c|
  c.cassette_library_dir = 'cassettes'
  c.hook_into :fakeweb
  c.allow_http_connections_when_no_cassette = true
end

def cache cassette_name = 'my_cassette'
  VCR.use_cassette(cassette_name, :record => :new_episodes, :match_requests_on => [:method, :uri, :body]) do
    yield
  end
end

Save this to a file called 'cache.rb', and now you've got a simple way to cache requests in your scripts:

require 'mechanize'
require './cache.rb'

cache do
  @agent = Mechanize.new
  page = @agent.get 'http://www.amazon.com/'
  puts page.title
end

Thursday, December 6, 2012

Convert relative urls to absolute in php with Phpuri

Here's another common problem, how do I convert my relative urls to absolute urls in php? In most scripting languages there's some built in class that can do this for you. Unfortunately php is really a web development language so general purpose libraries can be lacking.

I tested two popular solutions against my test case and in the end decided to create a simpllified 'port' of ruby's URI class. Let's take a look at the competition:

rel2abs - The nicest thing I can say about this solution is that it's the fastest. Unfortunately it failed 30% of my tests.
Usage: rel2abs($rel, $base)
UrlToAbsolute - This one did fairly well, passing 90% of my tests. Keep in mind that many of the tests are rare edge cases, so I imagine real world success would be close to 100%. I could almost be happy with this one. Unfortunately the global namespace clutter it creates is a potential disaster, I decided it's best to steer clear of this one as well.
Usage: url_to_absolute($base, $rel)
Phpuri - While it passed 100% of the tests, I will concede that the deck was stacked. In other words I had the tests in mind while writing it and the goal was specifically to pass those tests.
Usage: phpUri::parse($base)->join($rel)

The Verdict:

Obviously I'm biased but I'm scoring this one for Phpuri.
Leave a comment if you disagree, I want to hear about it.

The Code:

require 'phpuri.php';
echo phpUri::parse('https://www.google.com/')->join('foo');
//==> https://www.google.com/foo

The Download:

View the project or download just the latest source

Wednesday, December 5, 2012

Scraping a site that requires login in ruby/php

Some people have trouble scraping websites that require a login. I'm going to demonstrate how to do one that I've seen some people have trouble with, namely stubhub. I will scrape it using ruby mechanize, and then, just for fun, I will do it in php.

First instantiate your Mechanize object and turn off ssl verification:

require 'mechanize'
@agent = Mechanize.new
@agent.verify_mode = OpenSSL::SSL::VERIFY_NONE

Rather than go straight to the 'Sign in' link, we'll go to homepage, and then 'click through' to the sign in page. This will more accurately mimic real browser behavior.

page = @agent.get 'http://www.stubhub.com/'
page = page.link_with(:text => 'Sign in').click

Find the form, and fill out the login credentials. Notice that I use page.forms[1], this is because the login form is the second form on that page. If you're not sure which form it is you might want to throw in a binding.pry at this point and inspect page.forms.

form = page.forms[1]
form['loginEmail'] = email
form['loginPassword'] = password

That's it. Submit the form and let's see if it gives us the log in text we expect.

page = form.submit
puts page.at('ul#headermenu li').text

The output says: Hi P! which means it worked and I'm logged in. Now let's cross our fingers and see if it's that easy in php using PGBrowser:

require 'pgbrowser/pgbrowser.php';

$b = new PGBrowser();

$page = $b->get('http://www.stubhub.com/');
$url = $page->at('//a[.="Sign in"]')->getAttribute('href');
$page = $b->get($url);

$form = $page->forms(1);
$form->set('loginEmail', $email);
$form->set('loginPassword', $password);
$page = $form->submit();

echo $page->at('//ul[@id="headermenu"]/li')->nodeValue;

It Works!

Thursday, November 29, 2012

Creating on demand proxies for your scraping script

Last week I wrote about using a ec2 instance as a proxy. If you followed all the steps, you should now have an ami image that you can use to spin up new proxies on demand. This is the goal for today.

I will be using ruby's fog gem and mechanize to launch a new proxy, connect through it and scrape simple content.

First go back and discover the ami id of the proxy ami you created. It's in the AMIs section of your ec2 dashboard and the id will look like: ami-xxxxxxxx

I like to keep things like this in my ENV so they don't end up in a script floating around on the internet. In windows, add PROXY_AMI to your environment variables, or for linux put export PROXY_AMI=ami-xxxxxxxx in your .bash_profile

The script:

require 'mechanize'
require 'fog'

compute = Fog::Compute.new :provider => 'AWS',
  :aws_access_key_id => ENV['AMAZON_ACCESS_KEY_ID'],
  :aws_secret_access_key => ENV['AMAZON_SECRET_ACCESS_KEY']

proxy = compute.servers.create :image_id => ENV['PROXY_AMI']
proxy.wait_for { ready? }

At this point we have spun up the proxy instance and waited for it to be in a ready state. Note that this doesn't yet mean that it's able to proxy our requests. This is because there is a delay between bootup and starting the proxy process.

agent = Mechanize.new
agent.set_proxy proxy.public_ip_address, 8080

Now we have instantated the Mechanize address and set the proxy to our new instance. The proxy does not need to be ready for us to set_proxy to it.

until page = agent.get('http://ww.google.com/') rescue nil
  sleep 1
  puts 'waiting for proxy'
end

There's some debate about what's the best way to check if a proxy is working, but I think it's best to just try to connect until it works.

puts page.title

proxy.destroy

Don't forget that last destroy line unless you want a big surprise on your next AWS bill.

One last gotcha, make sure that your default security group allows you to connect on 8080 (or whatever port you choose). I like to allow all TCP traffic from my development machine's IP address.

Thursday, November 22, 2012

Using an AWS ec2 micro instance as a proxy

Today we'll be setting up a proxy using an ec2 micro instance. These instances are cheap ($.50/day) but can handle tons of throughput which makes them perfect for running proxies. Also nice is you can run them in practically any geographic location you want. This blog post assumes that you already have an aws account and are familiar with security groups and key pairs.

First start by logging into AWS, Go to your instances tab and click Launch Instance. I'm choozing an Amazon Linux 32 bit instance but you can use ubuntu if you prefer. Choose micro for instance type and the rest is as you prefer. When you get to the tags section type 'Proxy' for the name.

When your instance launches make a note of it's Public DNS and connect to it like so:

ssh -i mykey.pem ec2-user@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com

(replace mykey and the xxx's with your keyname and public ip)

now to create a simple proxy, type:

cat > proxy.rb

and then paste in the following

require 'webrick'
require 'webrick/httpproxy'

s = WEBrick::HTTPProxyServer.new(:Port => 8080).start

Then type (crl-D) to exit cat.

We want the proxy to start running on startup so let's add it to the crontab:

crontab -e

now add the following line:

@reboot /usr/bin/ruby /home/ec2-user/proxy.rb

(crl-C) + :wq to exit vim
now reboot the server:

sudo reboot

That's it, now you've got a http/https proxy that you can connect to and send unlimited amounts of bandwidth over (careful though, you will be charged extra for the bandwidth.)

I want to do one more thing before we're finished here and make this instance into an AMI Image so we can spin up more later.

Go back to your instance tab, select the proxy instance, then go up to Instance Actions and select 'Create Image'. Name the image 'Proxy' and save it. This will allow you to spin up an identical image in the future on demand and connect to it whenever you need a good quick proxy. Once the proxy mai image is created you can terminate your proxy instance when you're done with it.

Saturday, November 17, 2012

Saving generated xml to an S3 store

Here's a problem that come up recently, how to generate an xml document and save it to S3 without creating an intermediate local file. I'm using Nokogiri's XML builder and the fog gem:

require 'fog'
require 'nokogiri'

# build the xml
builder = Nokogiri::XML::Builder.new do |xml|
  xml.root do
    xml.foo 'bar'
  end
end

# log in to S3 and save it
storage = Fog::Storage.new :provider => 'AWS',
  :aws_access_key_id => ENV['AMAZON_ACCESS_KEY_ID'],
  :aws_secret_access_key => ENV['AMAZON_SECRET_ACCESS_KEY']

dir = storage.directories.get 'my_bucket'
dir.files.create :key => 'huge.xml', :body   => builder.to_xml

Notice the Amazon credentials are in environmental variables rather than coded in. This will protect you from accidentally giving them away (leaving them in a public git repo for example)

Sunday, November 4, 2012

Introducing PGBrowser

I finally got tired of complaining about how php doesn't have a mechanize-style scraping library that does forms and cookies, and decided to make one. There's not too many bells and whistles (yet) but I did include support for doPostBack asp actions.

To test it I used this classic example from the scraperwiki blog.

require 'pgbrowser/pgbrowser.php';

$b = new PGBrowser();
$page = $b->get('http://data.fingal.ie/ViewDataSets/');

while($nextLink = $page->at('//a[@id="lnkNext"][@href]')){
  echo "I found another page!\n";
  $page = $page->form()->doPostBack($nextLink->getAttribute('href'));
}

I expect this to really take the pain out of scraping forms with php from now on.
View the project or download the source.

Saturday, November 3, 2012

Choosing a Php HTML parser

Today we'll be comparing some HTML parsing libraries for php and picking a winner. The ideal candidate will support css3 selectors, be DOM based easy to use. I've rounded up 3 candidates:

Simple HTML Dom - A favorite with most php users from what I've seen.
phpQuery - An interesting project, claiming to be a jQuery port.
Ganon - A new challenger.

The Setup

Names
Mitt Romney
Barack Obama

The Test

# simple html dom
require('simple_html_dom.php');
$doc = str_get_html($html);
echo $doc->find('label ~ span[id$=2]', 0)->innerText();

# phpquery
require('phpQuery.php');
$doc = phpQuery::newDocumentHTML($html);
phpQuery::selectDocument($doc);
echo pq('label ~ span[id$=2]')->text();

# ganon
require('ganon.php');
$doc = str_get_dom($html);
echo $doc('label ~ span[id$=2]', 0)->getInnerText();

The results

Simple html dom was the only one to fail the css3 test. This is clearly a deal breaker, without support for simple css like sibling selector (~), there's just no way I could justify looking any further at this library.

phpQuery passed the css3 test, its selection syntax feels the cleanest and it's Dom based, unlike the other two. I also like that it's PEAR installable.

Ganon also passed the css3 test, and it actually outperformed phpQuery for 10K iterations (7 seconds to phpQuery's 8). Definitely a strong contender.

The winner

phpQuery - Full css3 support, DOM based, PEAR installable, and a clean syntax edges out Ganon for the top spot this time.

Friday, November 2, 2012

Do it the right way with css

Looking around at other people's scraping projects I still see a lot of peple doing it the wrong way. And I'm not talking about parsing html with regex this time. I'm talking about using xpath expressions instead of css selectors. Let's take a look:


  Name
  Barack Obama

There's 2 ways for me to get at the data I want:

# using xpath
doc.at('//label[@for="name"]/following-sibling::span').text
# using css
doc.at('label[for=name] + span').text

So which one is better? Unless you're a machine the answer is always css. Because css is a human-friendly way to select the data you want, your code will be easier to maintain than the hot mess created with xpath expressions. There's a good reason why web designers have been using it for so long.

Wednesday, October 31, 2012

Making on the fly changes to running ruby jobs

Let's say you've got a long running ruby script: and you need to be able to make config changes. But you don't want to stop the program because then you'd have to start all over. Add this to the top of the script

require 'pry'
trap("INT"){binding.pry}

And now you can interrupt it with a crl-c. After making changes to the environment exit will let the script resume. Use exit-program if you really want the script to exit.

For example I might use this to change the proxy my agent is connecting through on a long scrape job when things gets a little slow.

Sunday, October 28, 2012

Asp forms with doPostBack using ruby mechanize

Here's a simple scraping problem that causes problems for lots of people. How to page aspx search results that use doPostBack actions. I'm basing this on an older ScraperWiki post that used python.

doPostBack is not as scary as people think. It basically takes 2 arguments, sets the form values and submits the form. I'll make it simple by monkey patching it into Mechanize::Form

require 'mechanize'

class Mechanize::Form
  def postback target, argument
    self['__EVENTTARGET'], self['__EVENTARGUMENT'] = target, argument
    submit
  end
end

The rest is simple. Find the 'Next' link, parse out the values and send them to Form#postback. Put it in a while loop and you've got paging.

agent = Mechanize.new
page = agent.get 'http://data.fingal.ie/ViewDataSets/'

while next_link = page.at('a#lnkNext[href]')
  puts 'I found another page!'
  target, argument = next_link[:href].scan(/'([^']*)'/).flatten
  page = page.form.postback target, argument
end

The result is much cleaner than what I've seen from the python side. Ruby's mechanize is sophisticated enough to avoid all of the many pitfalls of its python counterpart. No wonder I like it so much!