First we will set up our proxies:
$proxies = array( 'proxy1.com:80', 'proxy2.com:80', 'proxy3.com:80' );
Next we want to define what a good response looks like. You might want to check the title or the full html for a specific string. For this tutorial though we're counting any page that has html of length greater than 0 bytes.
function response_ok($page){
return strlen($page->html) > 0
}
Now instead of using
$browser->get(), we are going to use a custom get() function that checks responses and rotates proxies as needed:
function get($url){
global $browser, $proxies;
$page = $browser->get($url);
while(!response_ok($page)){
$proxy = array_shift($proxies); # grab the first proxy
array_push($proxies, $proxy); # push it back to the end
echo "switching proxy to $proxy\n";
list($host,$port) = explode(':', $proxy);
$browser->setProxy($host, $port);
$page = $browser->get($url);
}
return $page;
}
That's it, now we just call
get() in the way we would ordinarily call $browser->get():
$browser = new PGBrowser();
$page = get('http://myurl.com'); # scrape with rotating proxy support
what a useless post
ReplyDelete