Monday, April 15, 2013

Scrape a Website in Php with a Network Cache

Sometimes it's helpful to use cached responses in a scraping project. Like when you're running a long scrape job and you're afraid the script will crash halfway through, or when you're fine-tuning a css selector or xpath expression 5 requests into your script and are losing productivity to the 30 second delay.

Now you can avoid those delays with PGBrowser's useCache property

$b = new PGBrowser();
$b->useCache = true;

Let's try it out with a test I like to use for the forms functionality:

require 'pgbrowser.php';
$b = new PGBrowser();
$b->useCache = true;
$page = $b->get('http://www.google.com/');
$form = $page->form();
$form->set('q', 'foo');
$page = $form->submit();
echo preg_match ('/foo - /', $page->title) ? 'success' : 'failure';

The first time ruinning this script takes about 6 seconds. The responses get saved in a folder called cache, and the next time you run it should only take about 1 second.
View the project or download the latest source.

No comments:

Post a Comment