ScraperBlog: July 2013

Tuesday, July 9, 2013

Php Curl Multipart form posting

Let's say you want to post some data to a form using curl:

$url = 'http://www.example.com/';
$data = array('foo' => '1', 'bar' => '2');

Ordinarily you create the post body using http_build_query(). But let's say the form is expecting the form data to be multipart encoded. Now we've got a challenge. First let's create a function to do the multipart encoding:

function multipart_build_query($fields, $boundary){
  $retval = '';
  foreach($fields as $key => $value){
    $retval .= "--$boundary\nContent-Disposition: form-data; name=\"$key\"\n\n$value\n";
  }
  $retval .= "--$boundary--";
  return $retval;
}

The boundary is a string that separates the fields. It can be any string you want but you should choose something long enough that it won't randomly show up in your data.

$boundary = '--myboundary-xxx';
$body = multipart_build_query($data, $boundary);

Now make your curl post, but remember to set the content type:

$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Content-Type: multipart/form-data; boundary=$boundary"));
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
$response = curl_exec($ch);

Of course, you could have just used PGBrowser to submit the form, and it will automatically detect when it needs to use multipart encoding. But that would be too easy.

Sunday, July 7, 2013

Php - scrape website with rotating proxies

Have you ever tried to scrape a website that imposes per-ip-address request limits? Let's take a look at how we can get around that in php by rotating through an array of proxies. This tutorial uses the PGBrowser scraping library.

First we will set up our proxies:

$proxies = array(
  'proxy1.com:80', 
  'proxy2.com:80', 
  'proxy3.com:80'
);

Next we want to define what a good response looks like. You might want to check the title or the full html for a specific string. For this tutorial though we're counting any page that has html of length greater than 0 bytes.

function response_ok($page){
  return strlen($page->html) > 0
}

Now instead of using $browser->get(), we are going to use a custom get() function that checks responses and rotates proxies as needed:

function get($url){
  global $browser, $proxies;
  $page = $browser->get($url);
  
  while(!response_ok($page)){
    $proxy = array_shift($proxies); # grab the first proxy
    array_push($proxies, $proxy); # push it back to the end
    echo "switching proxy to $proxy\n";
    list($host,$port) = explode(':', $proxy);
    $browser->setProxy($host, $port);
    $page = $browser->get($url);
  }
  return $page;
}

That's it, now we just call get() in the way we would ordinarily call $browser->get():

$browser = new PGBrowser();
$page = get('http://myurl.com'); # scrape with rotating proxy support