Thursday, November 29, 2012

Creating on demand proxies for your scraping script

Last week I wrote about using a ec2 instance as a proxy. If you followed all the steps, you should now have an ami image that you can use to spin up new proxies on demand. This is the goal for today.

I will be using ruby's fog gem and mechanize to launch a new proxy, connect through it and scrape simple content.

First go back and discover the ami id of the proxy ami you created. It's in the AMIs section of your ec2 dashboard and the id will look like: ami-xxxxxxxx

I like to keep things like this in my ENV so they don't end up in a script floating around on the internet. In windows, add PROXY_AMI to your environment variables, or for linux put export PROXY_AMI=ami-xxxxxxxx in your .bash_profile

The script:
require 'mechanize'
require 'fog'

compute = Fog::Compute.new :provider => 'AWS',
  :aws_access_key_id => ENV['AMAZON_ACCESS_KEY_ID'],
  :aws_secret_access_key => ENV['AMAZON_SECRET_ACCESS_KEY']

proxy = compute.servers.create :image_id => ENV['PROXY_AMI']
proxy.wait_for { ready? }

At this point we have spun up the proxy instance and waited for it to be in a ready state. Note that this doesn't yet mean that it's able to proxy our requests. This is because there is a delay between bootup and starting the proxy process.
agent = Mechanize.new
agent.set_proxy proxy.public_ip_address, 8080

Now we have instantated the Mechanize address and set the proxy to our new instance. The proxy does not need to be ready for us to set_proxy to it.
until page = agent.get('http://ww.google.com/') rescue nil
  sleep 1
  puts 'waiting for proxy'
end

There's some debate about what's the best way to check if a proxy is working, but I think it's best to just try to connect until it works.
puts page.title

proxy.destroy

Don't forget that last destroy line unless you want a big surprise on your next AWS bill.

One last gotcha, make sure that your default security group allows you to connect on 8080 (or whatever port you choose). I like to allow all TCP traffic from my development machine's IP address.

No comments:

Post a Comment