Saturday, June 15, 2013

Ruby - keep track of urls you've already visited

How do I keep track or urls that I've already visit in my ruby projects? Here's what I do:

def visited? url
  @visited ||= []
  return true if @visited.include? url
  @visited << url

Now I've got a visited? method that tells me if I've called it already on an url

#=> false
#=> true

That makes it easy to just do something like:

scrape(url) unless visited?(url)
This was a short post but this is something I always put in my ruby code and I'll be referencing it later.

Thursday, June 13, 2013

Replace ruby's URI parser with Addressable

Today I was trying to read an url in ruby that URI didn't like the url of:

require 'open-uri'
open ''

generic.rb:213:in `initialize': the scheme http does not accept registry part: (or bad hostname?) (URI::InvalidURIError)

D'oh! This is a valid url, but sometimes URI can be a little bit old-fashioned about what to accept.

The solution? Addressable is a more RFC-conformant replacement for URI. But how to get open-uri and other libs to use it?

After poking around the internet for an hour or so and not coming up with anything I settled on this:

require 'addressable/uri'

class URI::Parser
  def split url
    a = Addressable::URI::parse url
    [a.scheme, a.userinfo,, a.port, nil, a.path, nil, a.query, a.fragment]

open ''

Yay! No parse error (obviously the url still won't open because I made it up.)

Notice that I threw away 2 parts of the url, registry and opaque. These are things that addressable doesn't have and I never see anyway so I don't expect it to be a problem.

I'll just start putting this bit of code into all my projects from now on and we'll have to wait and see if it creates any problems.