ScraperBlog: November 2012

Thursday, November 29, 2012

Creating on demand proxies for your scraping script

Last week I wrote about using a ec2 instance as a proxy. If you followed all the steps, you should now have an ami image that you can use to spin up new proxies on demand. This is the goal for today.

I will be using ruby's fog gem and mechanize to launch a new proxy, connect through it and scrape simple content.

First go back and discover the ami id of the proxy ami you created. It's in the AMIs section of your ec2 dashboard and the id will look like: ami-xxxxxxxx

I like to keep things like this in my ENV so they don't end up in a script floating around on the internet. In windows, add PROXY_AMI to your environment variables, or for linux put export PROXY_AMI=ami-xxxxxxxx in your .bash_profile

The script:

require 'mechanize'
require 'fog'

compute = Fog::Compute.new :provider => 'AWS',
  :aws_access_key_id => ENV['AMAZON_ACCESS_KEY_ID'],
  :aws_secret_access_key => ENV['AMAZON_SECRET_ACCESS_KEY']

proxy = compute.servers.create :image_id => ENV['PROXY_AMI']
proxy.wait_for { ready? }

At this point we have spun up the proxy instance and waited for it to be in a ready state. Note that this doesn't yet mean that it's able to proxy our requests. This is because there is a delay between bootup and starting the proxy process.

agent = Mechanize.new
agent.set_proxy proxy.public_ip_address, 8080

Now we have instantated the Mechanize address and set the proxy to our new instance. The proxy does not need to be ready for us to set_proxy to it.

until page = agent.get('http://ww.google.com/') rescue nil
  sleep 1
  puts 'waiting for proxy'
end

There's some debate about what's the best way to check if a proxy is working, but I think it's best to just try to connect until it works.

puts page.title

proxy.destroy

Don't forget that last destroy line unless you want a big surprise on your next AWS bill.

One last gotcha, make sure that your default security group allows you to connect on 8080 (or whatever port you choose). I like to allow all TCP traffic from my development machine's IP address.

Thursday, November 22, 2012

Using an AWS ec2 micro instance as a proxy

Today we'll be setting up a proxy using an ec2 micro instance. These instances are cheap ($.50/day) but can handle tons of throughput which makes them perfect for running proxies. Also nice is you can run them in practically any geographic location you want. This blog post assumes that you already have an aws account and are familiar with security groups and key pairs.

First start by logging into AWS, Go to your instances tab and click Launch Instance. I'm choozing an Amazon Linux 32 bit instance but you can use ubuntu if you prefer. Choose micro for instance type and the rest is as you prefer. When you get to the tags section type 'Proxy' for the name.

When your instance launches make a note of it's Public DNS and connect to it like so:

ssh -i mykey.pem ec2-user@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com

(replace mykey and the xxx's with your keyname and public ip)

now to create a simple proxy, type:

cat > proxy.rb

and then paste in the following

require 'webrick'
require 'webrick/httpproxy'

s = WEBrick::HTTPProxyServer.new(:Port => 8080).start

Then type (crl-D) to exit cat.

We want the proxy to start running on startup so let's add it to the crontab:

crontab -e

now add the following line:

@reboot /usr/bin/ruby /home/ec2-user/proxy.rb

(crl-C) + :wq to exit vim
now reboot the server:

sudo reboot

That's it, now you've got a http/https proxy that you can connect to and send unlimited amounts of bandwidth over (careful though, you will be charged extra for the bandwidth.)

I want to do one more thing before we're finished here and make this instance into an AMI Image so we can spin up more later.

Go back to your instance tab, select the proxy instance, then go up to Instance Actions and select 'Create Image'. Name the image 'Proxy' and save it. This will allow you to spin up an identical image in the future on demand and connect to it whenever you need a good quick proxy. Once the proxy mai image is created you can terminate your proxy instance when you're done with it.

Saturday, November 17, 2012

Saving generated xml to an S3 store

Here's a problem that come up recently, how to generate an xml document and save it to S3 without creating an intermediate local file. I'm using Nokogiri's XML builder and the fog gem:

require 'fog'
require 'nokogiri'

# build the xml
builder = Nokogiri::XML::Builder.new do |xml|
  xml.root do
    xml.foo 'bar'
  end
end

# log in to S3 and save it
storage = Fog::Storage.new :provider => 'AWS',
  :aws_access_key_id => ENV['AMAZON_ACCESS_KEY_ID'],
  :aws_secret_access_key => ENV['AMAZON_SECRET_ACCESS_KEY']

dir = storage.directories.get 'my_bucket'
dir.files.create :key => 'huge.xml', :body   => builder.to_xml

Notice the Amazon credentials are in environmental variables rather than coded in. This will protect you from accidentally giving them away (leaving them in a public git repo for example)

Sunday, November 4, 2012

Introducing PGBrowser

I finally got tired of complaining about how php doesn't have a mechanize-style scraping library that does forms and cookies, and decided to make one. There's not too many bells and whistles (yet) but I did include support for doPostBack asp actions.

To test it I used this classic example from the scraperwiki blog.

require 'pgbrowser/pgbrowser.php';

$b = new PGBrowser();
$page = $b->get('http://data.fingal.ie/ViewDataSets/');

while($nextLink = $page->at('//a[@id="lnkNext"][@href]')){
  echo "I found another page!\n";
  $page = $page->form()->doPostBack($nextLink->getAttribute('href'));
}

I expect this to really take the pain out of scraping forms with php from now on.
View the project or download the source.

Saturday, November 3, 2012

Choosing a Php HTML parser

Today we'll be comparing some HTML parsing libraries for php and picking a winner. The ideal candidate will support css3 selectors, be DOM based easy to use. I've rounded up 3 candidates:

Simple HTML Dom - A favorite with most php users from what I've seen.
phpQuery - An interesting project, claiming to be a jQuery port.
Ganon - A new challenger.

The Setup

Names
Mitt Romney
Barack Obama

The Test

# simple html dom
require('simple_html_dom.php');
$doc = str_get_html($html);
echo $doc->find('label ~ span[id$=2]', 0)->innerText();

# phpquery
require('phpQuery.php');
$doc = phpQuery::newDocumentHTML($html);
phpQuery::selectDocument($doc);
echo pq('label ~ span[id$=2]')->text();

# ganon
require('ganon.php');
$doc = str_get_dom($html);
echo $doc('label ~ span[id$=2]', 0)->getInnerText();

The results

Simple html dom was the only one to fail the css3 test. This is clearly a deal breaker, without support for simple css like sibling selector (~), there's just no way I could justify looking any further at this library.

phpQuery passed the css3 test, its selection syntax feels the cleanest and it's Dom based, unlike the other two. I also like that it's PEAR installable.

Ganon also passed the css3 test, and it actually outperformed phpQuery for 10K iterations (7 seconds to phpQuery's 8). Definitely a strong contender.

The winner

phpQuery - Full css3 support, DOM based, PEAR installable, and a clean syntax edges out Ganon for the top spot this time.

Friday, November 2, 2012

Do it the right way with css

Looking around at other people's scraping projects I still see a lot of peple doing it the wrong way. And I'm not talking about parsing html with regex this time. I'm talking about using xpath expressions instead of css selectors. Let's take a look:


  Name
  Barack Obama

There's 2 ways for me to get at the data I want:

# using xpath
doc.at('//label[@for="name"]/following-sibling::span').text
# using css
doc.at('label[for=name] + span').text

So which one is better? Unless you're a machine the answer is always css. Because css is a human-friendly way to select the data you want, your code will be easier to maintain than the hot mess created with xpath expressions. There's a good reason why web designers have been using it for so long.