Wednesday, November 20, 2013

Scrape a Website to Magento Configurable Product Import Format

Today I'm going to show how to scrape store products and export them to Magento's import format and keep the configurable product options that are associated. Like most things that involve Magento, this required a lot of patience and trial and error.

The goal's of the project are:
  • Learn how to scrape ecommerce data to Magento's configurable product import format
  • Get some sexy Magento sample store data for use in future testing and mock-ups
The project will use 2 libraries, a simple CSV class, and PGBrowser for the scraping. For the sake of simplicity (or not, depending on your point of view) I will use xpath expeessions to get the data rather than use Simple Html Dom or Phpquery. The full source for this project can be downloaded here.

Let's go over some of the code. First we instantiate our CSV object (yes, it's a global variable. I'm okay with that.) Then we load the listings page and iterate through each listing. Pretty self explanatory so far.

$csv = new CSV('products.csv', $fields, ",", null); // no utf-8 BOM

// and start scraping
$url = '';
$page = $browser->get($url);

foreach($page->search('//div[@class="fp-pro-name"]/a') as $a){
  echo '.';

So now we pass the a elements that have the details page urls to our scrape function. Because we earlier did $browser->convertUrls = true we no longer need to worry about converting our relative hrefs to absolute urls. The library took care of that for us.

Now we get the page for the link and start building our $item array which we will pass to the
save() function. Other than the ugly expression for description this was easy.

$url = $a->getAttribute('href');

$page = $browser->get($url);
$item = array();
$item['name'] = trim($a->nodeValue);
$item['description'] = $item['short_description'] = trim($page->at('//div[@class="pro-det-head"]/h4/text()[normalize-space()][position()=last()]')->nodeValue);

if(!preg_match('/Sale price: \$(\d+\.\d{2})/', $page->body, $m)) die('missing price!');
$item['price'] = $m[1];

if(!preg_match('/Style# : ([\w-]+)/', $page->body, $m)) die('missing sku!');
$item['sku'] = $m[1];
Next we save the image, for later import/upload - identify the categories we care about - and construct our items. The options need to look like:

$options = array(
 array('size' => '12', 'color' => 'purple'),
 array('size' => '10', 'color' => 'yellow')

Where the array keys are the attributes that you have made configurable product attributes (Global, Dropdown, Is used in Configurable Products)

That's all there is to it. I won't go into the save function because hopefully that one will just work for you.


  1. hey nice source for us,thanks for sharing nice thoughts of the scrape a website in a simple way and this code upload in this blog and this blog very informative and you have a informative blog.I definitely bookmark this blog.

    Web Scraping Software

  2. Thank you for sharing. For people with some technical knowledge it will be very valuable, but for ordinary store owners who do not know where to paste a piece of code or change some lines, here is also article that might be helpful to perform Magento configurable products import without coding -

  3. Hello,thanks for the article. I am a beginner magento programmer and dont understand where this code goes so can you give some help regarding that and also I cannot download the source, the link leads me to an xml file that says access denied