ScraperBlog: April 2013

Saturday, April 20, 2013

Finding a good proxy provider

The other day I was looking for a good free or cheap uk proxy for a scraper script I was writing for a client. My goto proxy was an ec2 micro instance in the Ireland zone, but for some reason this wasn't good enough so I went hunting around for one based in England.

I found a lot of free listings (xroxy.com) but all the http/s proxies on the list were either down or not in uk like they claimed.

Then I found these guys, $4 for a month - I decided to take a chance.

I decided that I'm happy with the experience, a fast connection that doesn't get blocked by some of the sites that will block ec2 traffic.

Watch out for the recurring charge though. I noticed that they will keep charging me until I cancel.

Oh yeah, I am not affiliated in any way, blah blah blah.

Monday, April 15, 2013

Scrape a Website in Php with a Network Cache

Sometimes it's helpful to use cached responses in a scraping project. Like when you're running a long scrape job and you're afraid the script will crash halfway through, or when you're fine-tuning a css selector or xpath expression 5 requests into your script and are losing productivity to the 30 second delay.

Now you can avoid those delays with PGBrowser's useCache property

$b = new PGBrowser();
$b->useCache = true;

Let's try it out with a test I like to use for the forms functionality:

require 'pgbrowser.php';
$b = new PGBrowser();
$b->useCache = true;
$page = $b->get('http://www.google.com/');
$form = $page->form();
$form->set('q', 'foo');
$page = $form->submit();
echo preg_match ('/foo - /', $page->title) ? 'success' : 'failure';

The first time ruinning this script takes about 6 seconds. The responses get saved in a folder called cache, and the next time you run it should only take about 1 second.
View the project or download the latest source.