Saturday, June 15, 2013

Ruby - keep track of urls you've already visited

How do I keep track or urls that I've already visit in my ruby projects? Here's what I do:

def visited? url
  @visited ||= []
  return true if @visited.include? url
  @visited << url
  false
end

Now I've got a visited? method that tells me if I've called it already on an url

visited?('http://www.google.com')
#=> false
visited?('http://www.google.com')
#=> true

That makes it easy to just do something like:

scrape(url) unless visited?(url)
This was a short post but this is something I always put in my ruby code and I'll be referencing it later.

Thursday, June 13, 2013

Replace ruby's URI parser with Addressable

Today I was trying to read an url in ruby that URI didn't like the url of:

require 'open-uri'
open 'http://foo_bar.baz.com/'

generic.rb:213:in `initialize': the scheme http does not accept registry part: foo_bar.baz.com (or bad hostname?) (URI::InvalidURIError)

D'oh! This is a valid url, but sometimes URI can be a little bit old-fashioned about what to accept.

The solution? Addressable is a more RFC-conformant replacement for URI. But how to get open-uri and other libs to use it?

After poking around the internet for an hour or so and not coming up with anything I settled on this:

require 'addressable/uri'

class URI::Parser
  def split url
    a = Addressable::URI::parse url
    [a.scheme, a.userinfo, a.host, a.port, nil, a.path, nil, a.query, a.fragment]
  end
end

open 'http://foo_bar.baz.com/'

Yay! No parse error (obviously the url still won't open because I made it up.)

Notice that I threw away 2 parts of the url, registry and opaque. These are things that addressable doesn't have and I never see anyway so I don't expect it to be a problem.

I'll just start putting this bit of code into all my projects from now on and we'll have to wait and see if it creates any problems.