ScraperBlog: ruby

Showing posts with label ruby. Show all posts

Saturday, June 15, 2013

Ruby - keep track of urls you've already visited

How do I keep track or urls that I've already visit in my ruby projects? Here's what I do:

def visited? url
  @visited ||= []
  return true if @visited.include? url
  @visited << url
  false
end

Now I've got a visited? method that tells me if I've called it already on an url

visited?('http://www.google.com')
#=> false
visited?('http://www.google.com')
#=> true

That makes it easy to just do something like:

scrape(url) unless visited?(url)

This was a short post but this is something I always put in my ruby code and I'll be referencing it later.

Thursday, June 13, 2013

Replace ruby's URI parser with Addressable

Today I was trying to read an url in ruby that URI didn't like the url of:

require 'open-uri'
open 'http://foo_bar.baz.com/'

generic.rb:213:in `initialize': the scheme http does not accept registry part: foo_bar.baz.com (or bad hostname?) (URI::InvalidURIError)

D'oh! This is a valid url, but sometimes URI can be a little bit old-fashioned about what to accept.

The solution? Addressable is a more RFC-conformant replacement for URI. But how to get open-uri and other libs to use it?

After poking around the internet for an hour or so and not coming up with anything I settled on this:

require 'addressable/uri'

class URI::Parser
  def split url
    a = Addressable::URI::parse url
    [a.scheme, a.userinfo, a.host, a.port, nil, a.path, nil, a.query, a.fragment]
  end
end

open 'http://foo_bar.baz.com/'

Yay! No parse error (obviously the url still won't open because I made it up.)

Notice that I threw away 2 parts of the url, registry and opaque. These are things that addressable doesn't have and I never see anyway so I don't expect it to be a problem.

I'll just start putting this bit of code into all my projects from now on and we'll have to wait and see if it creates any problems.

Saturday, March 9, 2013

Ruby: Rate limiting concurrent downloads

Yesterday an interesting question was posed on stackoverflow, how to ensure your script doesn't scrape a website or API too fast when making concurrent requests. Like so many interesting questions, this one was deemed to be not a real question and closed by moderators. So today I'll share my thoughts on the subject here. Let's avoid the complication of using Event Machine for this one, which, I could argue, creates as many problems as it solves.

First we're going to set up a queue and some variables. We'll use open-uri for the downloads to make it easy:

require 'open-uri'

queue = [
'http://www.google.com/',
'http://www.bing.com/',
'http://www.yahoo.com/',
'http://www.wikipedia.com/',
'http://www.amazon.com/'
]

num_threads = 3 # more is better, memory permitting
delay_per_request = 1 # in seconds

Next we create our threads and give them something to do. In a real script you'll need them to do something interesting but for this purpose they will just print out the url and response body size:

threads = []

num_threads.times do
  threads << Thread.new do
    Thread.exit unless url = queue.pop
    puts "#{url} is #{open(url).read.length} bytes long"
  end
end

Now that we have our threads we want to 'join' them. We also want to time them to see how long they took:

start = Time.now
threads.each{|t| t.join}
elapsed = Time.now - start

If they finished too quickly we need to take a short nap, otherwise we're free to continue processing the queue

time_to_sleep = num_threads * delay_per_request - elapsed
if time_to_sleep > 0
  puts "sleeping for #{time_to_sleep} seconds"
  sleep time_to_sleep
end

Ok, so now it's time to put it all together and process the queue in a loop.

require 'open-uri'

queue = [
'http://www.google.com/',
'http://www.bing.com/',
'http://www.yahoo.com/',
'http://www.wikipedia.com/',
'http://www.amazon.com/'
]

num_threads = 3 # more is better, memory permitting
delay_per_request = 1 # in seconds

until queue.empty?
  threads = []

  num_threads.times do
    threads << Thread.new do
      Thread.exit unless url = queue.pop
      puts "#{url} is #{open(url).read.length} bytes long"
    end
  end

  start = Time.now
  threads.each{|t| t.join}
  elapsed = Time.now - start

  time_to_sleep = num_threads * delay_per_request - elapsed
  if time_to_sleep > 0
    puts "sleeping for #{time_to_sleep} seconds"
    sleep time_to_sleep
  end
end

If you found this useful, let me know.

Thursday, December 20, 2012

Easy Web Caching with VCR

Testing a scraper script sometimes means repeating a lot of http requests. Did you ever wish for an easy way to cache http responses to speed up your development? Here's an easy tip using ruby's vcr and fakeweb gems.

require 'vcr'
require 'fakeweb'

VCR.configure do |c|
  c.cassette_library_dir = 'cassettes'
  c.hook_into :fakeweb
  c.allow_http_connections_when_no_cassette = true
end

def cache cassette_name = 'my_cassette'
  VCR.use_cassette(cassette_name, :record => :new_episodes, :match_requests_on => [:method, :uri, :body]) do
    yield
  end
end

Save this to a file called 'cache.rb', and now you've got a simple way to cache requests in your scripts:

require 'mechanize'
require './cache.rb'

cache do
  @agent = Mechanize.new
  page = @agent.get 'http://www.amazon.com/'
  puts page.title
end

Wednesday, December 5, 2012

Scraping a site that requires login in ruby/php

Some people have trouble scraping websites that require a login. I'm going to demonstrate how to do one that I've seen some people have trouble with, namely stubhub. I will scrape it using ruby mechanize, and then, just for fun, I will do it in php.

First instantiate your Mechanize object and turn off ssl verification:

require 'mechanize'
@agent = Mechanize.new
@agent.verify_mode = OpenSSL::SSL::VERIFY_NONE

Rather than go straight to the 'Sign in' link, we'll go to homepage, and then 'click through' to the sign in page. This will more accurately mimic real browser behavior.

page = @agent.get 'http://www.stubhub.com/'
page = page.link_with(:text => 'Sign in').click

Find the form, and fill out the login credentials. Notice that I use page.forms[1], this is because the login form is the second form on that page. If you're not sure which form it is you might want to throw in a binding.pry at this point and inspect page.forms.

form = page.forms[1]
form['loginEmail'] = email
form['loginPassword'] = password

That's it. Submit the form and let's see if it gives us the log in text we expect.

page = form.submit
puts page.at('ul#headermenu li').text

The output says: Hi P! which means it worked and I'm logged in. Now let's cross our fingers and see if it's that easy in php using PGBrowser:

require 'pgbrowser/pgbrowser.php';

$b = new PGBrowser();

$page = $b->get('http://www.stubhub.com/');
$url = $page->at('//a[.="Sign in"]')->getAttribute('href');
$page = $b->get($url);

$form = $page->forms(1);
$form->set('loginEmail', $email);
$form->set('loginPassword', $password);
$page = $form->submit();

echo $page->at('//ul[@id="headermenu"]/li')->nodeValue;

It Works!

Thursday, November 29, 2012

Creating on demand proxies for your scraping script

Last week I wrote about using a ec2 instance as a proxy. If you followed all the steps, you should now have an ami image that you can use to spin up new proxies on demand. This is the goal for today.

I will be using ruby's fog gem and mechanize to launch a new proxy, connect through it and scrape simple content.

First go back and discover the ami id of the proxy ami you created. It's in the AMIs section of your ec2 dashboard and the id will look like: ami-xxxxxxxx

I like to keep things like this in my ENV so they don't end up in a script floating around on the internet. In windows, add PROXY_AMI to your environment variables, or for linux put export PROXY_AMI=ami-xxxxxxxx in your .bash_profile

The script:

require 'mechanize'
require 'fog'

compute = Fog::Compute.new :provider => 'AWS',
  :aws_access_key_id => ENV['AMAZON_ACCESS_KEY_ID'],
  :aws_secret_access_key => ENV['AMAZON_SECRET_ACCESS_KEY']

proxy = compute.servers.create :image_id => ENV['PROXY_AMI']
proxy.wait_for { ready? }

At this point we have spun up the proxy instance and waited for it to be in a ready state. Note that this doesn't yet mean that it's able to proxy our requests. This is because there is a delay between bootup and starting the proxy process.

agent = Mechanize.new
agent.set_proxy proxy.public_ip_address, 8080

Now we have instantated the Mechanize address and set the proxy to our new instance. The proxy does not need to be ready for us to set_proxy to it.

until page = agent.get('http://ww.google.com/') rescue nil
  sleep 1
  puts 'waiting for proxy'
end

There's some debate about what's the best way to check if a proxy is working, but I think it's best to just try to connect until it works.

puts page.title

proxy.destroy

Don't forget that last destroy line unless you want a big surprise on your next AWS bill.

One last gotcha, make sure that your default security group allows you to connect on 8080 (or whatever port you choose). I like to allow all TCP traffic from my development machine's IP address.

Thursday, November 22, 2012

Using an AWS ec2 micro instance as a proxy

Today we'll be setting up a proxy using an ec2 micro instance. These instances are cheap ($.50/day) but can handle tons of throughput which makes them perfect for running proxies. Also nice is you can run them in practically any geographic location you want. This blog post assumes that you already have an aws account and are familiar with security groups and key pairs.

First start by logging into AWS, Go to your instances tab and click Launch Instance. I'm choozing an Amazon Linux 32 bit instance but you can use ubuntu if you prefer. Choose micro for instance type and the rest is as you prefer. When you get to the tags section type 'Proxy' for the name.

When your instance launches make a note of it's Public DNS and connect to it like so:

ssh -i mykey.pem ec2-user@ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com

(replace mykey and the xxx's with your keyname and public ip)

now to create a simple proxy, type:

cat > proxy.rb

and then paste in the following

require 'webrick'
require 'webrick/httpproxy'

s = WEBrick::HTTPProxyServer.new(:Port => 8080).start

Then type (crl-D) to exit cat.

We want the proxy to start running on startup so let's add it to the crontab:

crontab -e

now add the following line:

@reboot /usr/bin/ruby /home/ec2-user/proxy.rb

(crl-C) + :wq to exit vim
now reboot the server:

sudo reboot

That's it, now you've got a http/https proxy that you can connect to and send unlimited amounts of bandwidth over (careful though, you will be charged extra for the bandwidth.)

I want to do one more thing before we're finished here and make this instance into an AMI Image so we can spin up more later.

Go back to your instance tab, select the proxy instance, then go up to Instance Actions and select 'Create Image'. Name the image 'Proxy' and save it. This will allow you to spin up an identical image in the future on demand and connect to it whenever you need a good quick proxy. Once the proxy mai image is created you can terminate your proxy instance when you're done with it.

Saturday, November 17, 2012

Saving generated xml to an S3 store

Here's a problem that come up recently, how to generate an xml document and save it to S3 without creating an intermediate local file. I'm using Nokogiri's XML builder and the fog gem:

require 'fog'
require 'nokogiri'

# build the xml
builder = Nokogiri::XML::Builder.new do |xml|
  xml.root do
    xml.foo 'bar'
  end
end

# log in to S3 and save it
storage = Fog::Storage.new :provider => 'AWS',
  :aws_access_key_id => ENV['AMAZON_ACCESS_KEY_ID'],
  :aws_secret_access_key => ENV['AMAZON_SECRET_ACCESS_KEY']

dir = storage.directories.get 'my_bucket'
dir.files.create :key => 'huge.xml', :body   => builder.to_xml

Notice the Amazon credentials are in environmental variables rather than coded in. This will protect you from accidentally giving them away (leaving them in a public git repo for example)

Wednesday, October 31, 2012

Making on the fly changes to running ruby jobs

Let's say you've got a long running ruby script: and you need to be able to make config changes. But you don't want to stop the program because then you'd have to start all over. Add this to the top of the script

require 'pry'
trap("INT"){binding.pry}

And now you can interrupt it with a crl-c. After making changes to the environment exit will let the script resume. Use exit-program if you really want the script to exit.

For example I might use this to change the proxy my agent is connecting through on a long scrape job when things gets a little slow.

Sunday, October 28, 2012

Asp forms with doPostBack using ruby mechanize

Here's a simple scraping problem that causes problems for lots of people. How to page aspx search results that use doPostBack actions. I'm basing this on an older ScraperWiki post that used python.

doPostBack is not as scary as people think. It basically takes 2 arguments, sets the form values and submits the form. I'll make it simple by monkey patching it into Mechanize::Form

require 'mechanize'

class Mechanize::Form
  def postback target, argument
    self['__EVENTTARGET'], self['__EVENTARGUMENT'] = target, argument
    submit
  end
end

The rest is simple. Find the 'Next' link, parse out the values and send them to Form#postback. Put it in a while loop and you've got paging.

agent = Mechanize.new
page = agent.get 'http://data.fingal.ie/ViewDataSets/'

while next_link = page.at('a#lnkNext[href]')
  puts 'I found another page!'
  target, argument = next_link[:href].scan(/'([^']*)'/).flatten
  page = page.form.postback target, argument
end

The result is much cleaner than what I've seen from the python side. Ruby's mechanize is sophisticated enough to avoid all of the many pitfalls of its python counterpart. No wonder I like it so much!