ScraperBlog: nokogiri

Saturday, November 17, 2012

Saving generated xml to an S3 store

Here's a problem that come up recently, how to generate an xml document and save it to S3 without creating an intermediate local file. I'm using Nokogiri's XML builder and the fog gem:

require 'fog'
require 'nokogiri'

# build the xml
builder = Nokogiri::XML::Builder.new do |xml|
  xml.root do
    xml.foo 'bar'
  end
end

# log in to S3 and save it
storage = Fog::Storage.new :provider => 'AWS',
  :aws_access_key_id => ENV['AMAZON_ACCESS_KEY_ID'],
  :aws_secret_access_key => ENV['AMAZON_SECRET_ACCESS_KEY']

dir = storage.directories.get 'my_bucket'
dir.files.create :key => 'huge.xml', :body   => builder.to_xml

Notice the Amazon credentials are in environmental variables rather than coded in. This will protect you from accidentally giving them away (leaving them in a public git repo for example)

Friday, November 2, 2012

Do it the right way with css

Looking around at other people's scraping projects I still see a lot of peple doing it the wrong way. And I'm not talking about parsing html with regex this time. I'm talking about using xpath expressions instead of css selectors. Let's take a look:


  Name
  Barack Obama

There's 2 ways for me to get at the data I want:

# using xpath
doc.at('//label[@for="name"]/following-sibling::span').text
# using css
doc.at('label[for=name] + span').text

So which one is better? Unless you're a machine the answer is always css. Because css is a human-friendly way to select the data you want, your code will be easier to maintain than the hot mess created with xpath expressions. There's a good reason why web designers have been using it for so long.