Monday, August 18, 2014

How to Find Good Free Site-Specific Proxies

I'm always looking for a source for good free proxies for my scraping projects. Have you ever tried scraping a site that blocks by geography or limits requests by IP addresses? If so then you will appreciate this post.

Having recently stumbled upon Jrim's Proxy Multiply, it's now my go-to tool for grabbing some quick free proxies. I just hit the "Get proxies!" button and within an hour or two I've got 3,000 tested working proxies.



And if you need some site-specific proxies, it can do that too. Just follow these steps:
  • open the Configure menu
  • go to HTTP Proxy Testing
  • check Check proxies against a specific url and click configure
  • in the dialog that opens, put the homepage (http://www.manta.com, http://www.yelp.com, etc) in the Url to Navigate to textfield
  • in the Content to look for in the page you want to put any text that will always show up on a page (I like to use the google-analytics id)

Thursday, November 28, 2013

How to Import Configurable Product Data into Magento

This is really part 2 of a 2 part tutorial on scraping configurable product data, which began here. For those people not interested in the scraping aspect of this project, you can download the magento sample data here. Warning, this sample data might be NSFW if you have an uptight boss.

The first thing we're going to need to do is set up our categories. Open the products spreadsheet in Excel or something similar and Highlight the _category column (column L) and copy that. Bring that into a text editor and sort the values / remove duplicates. Each unique needs to be a category in Magento or you will get an import error. When you create the the categories, set them to 'Enabled' so they show up in your storefront. Drag all your categories into the "Default Category" root category after you make them.

Next we need to set up the attributes. Copy columns _super_attribute_code and _super_attribute_option (S and T) into your text editor and sort/remove duplicates. You'll see unique values for 2 attributes, color and size. These both need to be set up in Magento.

So go to Catalog / Attributes/ Manage Attributes, create a new color attribute if it doesn't exist and set:

Scope - Global
Catalog Input Type for Store Owner - Dropdown
Use To Create Configurable Product - Yes

Next click on Manage Label / Options and add your color options. Then do the same for size.

Now you're all set for a clean import. Head over to System - Import / Export - Import, select Products, and upload the spreadsheet. If you are using Magent Go you can import the Product Images in the same way. Otherwise you'll want to ftp those to your media folder.

Wednesday, November 20, 2013

Scrape a Website to Magento Configurable Product Import Format

Today I'm going to show how to scrape store products and export them to Magento's import format and keep the configurable product options that are associated. Like most things that involve Magento, this required a lot of patience and trial and error.

The goal's of the project are:
  • Learn how to scrape ecommerce data to Magento's configurable product import format
  • Get some sexy Magento sample store data for use in future testing and mock-ups
The project will use 2 libraries, a simple CSV class, and PGBrowser for the scraping. For the sake of simplicity (or not, depending on your point of view) I will use xpath expeessions to get the data rather than use Simple Html Dom or Phpquery. The full source for this project can be downloaded here.

Let's go over some of the code. First we instantiate our CSV object (yes, it's a global variable. I'm okay with that.) Then we load the listings page and iterate through each listing. Pretty self explanatory so far.


$csv = new CSV('products.csv', $fields, ",", null); // no utf-8 BOM

// and start scraping
$url = 'http://www.spicylingerie.com/';
$page = $browser->get($url);

foreach($page->search('//div[@class="fp-pro-name"]/a') as $a){
  scrape($a);
  echo '.';
}

So now we pass the a elements that have the details page urls to our scrape function. Because we earlier did $browser->convertUrls = true we no longer need to worry about converting our relative hrefs to absolute urls. The library took care of that for us.

Now we get the page for the link and start building our $item array which we will pass to the
save() function. Other than the ugly expression for description this was easy.


$url = $a->getAttribute('href');

$page = $browser->get($url);
$item = array();
$item['name'] = trim($a->nodeValue);
$item['description'] = $item['short_description'] = trim($page->at('//div[@class="pro-det-head"]/h4/text()[normalize-space()][position()=last()]')->nodeValue);

if(!preg_match('/Sale price: \$(\d+\.\d{2})/', $page->body, $m)) die('missing price!');
$item['price'] = $m[1];

if(!preg_match('/Style# : ([\w-]+)/', $page->body, $m)) die('missing sku!');
$item['sku'] = $m[1];
Next we save the image, for later import/upload - identify the categories we care about - and construct our items. The options need to look like:

$options = array(
 array('size' => '12', 'color' => 'purple'),
 array('size' => '10', 'color' => 'yellow')
);

Where the array keys are the attributes that you have made configurable product attributes (Global, Dropdown, Is used in Configurable Products)

That's all there is to it. I won't go into the save function because hopefully that one will just work for you.

Tuesday, July 9, 2013

Php Curl Multipart form posting

Let's say you want to post some data to a form using curl:

$url = 'http://www.example.com/';
$data = array('foo' => '1', 'bar' => '2');

Ordinarily you create the post body using http_build_query(). But let's say the form is expecting the form data to be multipart encoded. Now we've got a challenge. First let's create a function to do the multipart encoding:

function multipart_build_query($fields, $boundary){
  $retval = '';
  foreach($fields as $key => $value){
    $retval .= "--$boundary\nContent-Disposition: form-data; name=\"$key\"\n\n$value\n";
  }
  $retval .= "--$boundary--";
  return $retval;
}

The boundary is a string that separates the fields. It can be any string you want but you should choose something long enough that it won't randomly show up in your data.

$boundary = '--myboundary-xxx';
$body = multipart_build_query($data, $boundary);


Now make your curl post, but remember to set the content type:
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Content-Type: multipart/form-data; boundary=$boundary"));
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
$response = curl_exec($ch);

Of course, you could have just used PGBrowser to submit the form, and it will automatically detect when it needs to use multipart encoding. But that would be too easy.

Sunday, July 7, 2013

Php - scrape website with rotating proxies

Have you ever tried to scrape a website that imposes per-ip-address request limits? Let's take a look at how we can get around that in php by rotating through an array of proxies. This tutorial uses the PGBrowser scraping library.

First we will set up our proxies:

$proxies = array(
  'proxy1.com:80', 
  'proxy2.com:80', 
  'proxy3.com:80'
);

Next we want to define what a good response looks like. You might want to check the title or the full html for a specific string. For this tutorial though we're counting any page that has html of length greater than 0 bytes.

function response_ok($page){
  return strlen($page->html) > 0
}

Now instead of using $browser->get(), we are going to use a custom get() function that checks responses and rotates proxies as needed:

function get($url){
  global $browser, $proxies;
  $page = $browser->get($url);
  
  while(!response_ok($page)){
    $proxy = array_shift($proxies); # grab the first proxy
    array_push($proxies, $proxy); # push it back to the end
    echo "switching proxy to $proxy\n";
    list($host,$port) = explode(':', $proxy);
    $browser->setProxy($host, $port);
    $page = $browser->get($url);
  }
  return $page;
}


That's it, now we just call get() in the way we would ordinarily call $browser->get():
$browser = new PGBrowser();
$page = get('http://myurl.com'); # scrape with rotating proxy support

Saturday, June 15, 2013

Ruby - keep track of urls you've already visited

How do I keep track or urls that I've already visit in my ruby projects? Here's what I do:

def visited? url
  @visited ||= []
  return true if @visited.include? url
  @visited << url
  false
end

Now I've got a visited? method that tells me if I've called it already on an url

visited?('http://www.google.com')
#=> false
visited?('http://www.google.com')
#=> true

That makes it easy to just do something like:

scrape(url) unless visited?(url)
This was a short post but this is something I always put in my ruby code and I'll be referencing it later.

Thursday, June 13, 2013

Replace ruby's URI parser with Addressable

Today I was trying to read an url in ruby that URI didn't like the url of:

require 'open-uri'
open 'http://foo_bar.baz.com/'

generic.rb:213:in `initialize': the scheme http does not accept registry part: foo_bar.baz.com (or bad hostname?) (URI::InvalidURIError)

D'oh! This is a valid url, but sometimes URI can be a little bit old-fashioned about what to accept.

The solution? Addressable is a more RFC-conformant replacement for URI. But how to get open-uri and other libs to use it?

After poking around the internet for an hour or so and not coming up with anything I settled on this:

require 'addressable/uri'

class URI::Parser
  def split url
    a = Addressable::URI::parse url
    [a.scheme, a.userinfo, a.host, a.port, nil, a.path, nil, a.query, a.fragment]
  end
end

open 'http://foo_bar.baz.com/'

Yay! No parse error (obviously the url still won't open because I made it up.)

Notice that I threw away 2 parts of the url, registry and opaque. These are things that addressable doesn't have and I never see anyway so I don't expect it to be a problem.

I'll just start putting this bit of code into all my projects from now on and we'll have to wait and see if it creates any problems.