First we will set up our proxies:
$proxies = array( 'proxy1.com:80', 'proxy2.com:80', 'proxy3.com:80' );
Next we want to define what a good response looks like. You might want to check the title or the full html for a specific string. For this tutorial though we're counting any page that has html of length greater than 0 bytes.
function response_ok($page){ return strlen($page->html) > 0 }
Now instead of using
$browser->get()
, we are going to use a custom get()
function that checks responses and rotates proxies as needed:
function get($url){ global $browser, $proxies; $page = $browser->get($url); while(!response_ok($page)){ $proxy = array_shift($proxies); # grab the first proxy array_push($proxies, $proxy); # push it back to the end echo "switching proxy to $proxy\n"; list($host,$port) = explode(':', $proxy); $browser->setProxy($host, $port); $page = $browser->get($url); } return $page; }
That's it, now we just call
get()
in the way we would ordinarily call $browser->get()
:
$browser = new PGBrowser(); $page = get('http://myurl.com'); # scrape with rotating proxy support
what a useless post
ReplyDelete