Sunday, July 7, 2013

Php - scrape website with rotating proxies

Have you ever tried to scrape a website that imposes per-ip-address request limits? Let's take a look at how we can get around that in php by rotating through an array of proxies. This tutorial uses the PGBrowser scraping library.

First we will set up our proxies:

$proxies = array(

Next we want to define what a good response looks like. You might want to check the title or the full html for a specific string. For this tutorial though we're counting any page that has html of length greater than 0 bytes.

function response_ok($page){
  return strlen($page->html) > 0

Now instead of using $browser->get(), we are going to use a custom get() function that checks responses and rotates proxies as needed:

function get($url){
  global $browser, $proxies;
  $page = $browser->get($url);
    $proxy = array_shift($proxies); # grab the first proxy
    array_push($proxies, $proxy); # push it back to the end
    echo "switching proxy to $proxy\n";
    list($host,$port) = explode(':', $proxy);
    $browser->setProxy($host, $port);
    $page = $browser->get($url);
  return $page;

That's it, now we just call get() in the way we would ordinarily call $browser->get():
$browser = new PGBrowser();
$page = get(''); # scrape with rotating proxy support