Thursday, November 28, 2013

How to Import Configurable Product Data into Magento

This is really part 2 of a 2 part tutorial on scraping configurable product data, which began here. For those people not interested in the scraping aspect of this project, you can download the magento sample data here. Warning, this sample data might be NSFW if you have an uptight boss.

The first thing we're going to need to do is set up our categories. Open the products spreadsheet in Excel or something similar and Highlight the _category column (column L) and copy that. Bring that into a text editor and sort the values / remove duplicates. Each unique needs to be a category in Magento or you will get an import error. When you create the the categories, set them to 'Enabled' so they show up in your storefront. Drag all your categories into the "Default Category" root category after you make them.

Next we need to set up the attributes. Copy columns _super_attribute_code and _super_attribute_option (S and T) into your text editor and sort/remove duplicates. You'll see unique values for 2 attributes, color and size. These both need to be set up in Magento.

So go to Catalog / Attributes/ Manage Attributes, create a new color attribute if it doesn't exist and set:

Scope - Global
Catalog Input Type for Store Owner - Dropdown
Use To Create Configurable Product - Yes

Next click on Manage Label / Options and add your color options. Then do the same for size.

Now you're all set for a clean import. Head over to System - Import / Export - Import, select Products, and upload the spreadsheet. If you are using Magent Go you can import the Product Images in the same way. Otherwise you'll want to ftp those to your media folder.

Wednesday, November 20, 2013

Scrape a Website to Magento Configurable Product Import Format

Today I'm going to show how to scrape store products and export them to Magento's import format and keep the configurable product options that are associated. Like most things that involve Magento, this required a lot of patience and trial and error.

The goal's of the project are:
  • Learn how to scrape ecommerce data to Magento's configurable product import format
  • Get some sexy Magento sample store data for use in future testing and mock-ups
The project will use 2 libraries, a simple CSV class, and PGBrowser for the scraping. For the sake of simplicity (or not, depending on your point of view) I will use xpath expeessions to get the data rather than use Simple Html Dom or Phpquery. The full source for this project can be downloaded here.

Let's go over some of the code. First we instantiate our CSV object (yes, it's a global variable. I'm okay with that.) Then we load the listings page and iterate through each listing. Pretty self explanatory so far.


$csv = new CSV('products.csv', $fields, ",", null); // no utf-8 BOM

// and start scraping
$url = 'http://www.spicylingerie.com/';
$page = $browser->get($url);

foreach($page->search('//div[@class="fp-pro-name"]/a') as $a){
  scrape($a);
  echo '.';
}

So now we pass the a elements that have the details page urls to our scrape function. Because we earlier did $browser->convertUrls = true we no longer need to worry about converting our relative hrefs to absolute urls. The library took care of that for us.

Now we get the page for the link and start building our $item array which we will pass to the
save() function. Other than the ugly expression for description this was easy.


$url = $a->getAttribute('href');

$page = $browser->get($url);
$item = array();
$item['name'] = trim($a->nodeValue);
$item['description'] = $item['short_description'] = trim($page->at('//div[@class="pro-det-head"]/h4/text()[normalize-space()][position()=last()]')->nodeValue);

if(!preg_match('/Sale price: \$(\d+\.\d{2})/', $page->body, $m)) die('missing price!');
$item['price'] = $m[1];

if(!preg_match('/Style# : ([\w-]+)/', $page->body, $m)) die('missing sku!');
$item['sku'] = $m[1];
Next we save the image, for later import/upload - identify the categories we care about - and construct our items. The options need to look like:

$options = array(
 array('size' => '12', 'color' => 'purple'),
 array('size' => '10', 'color' => 'yellow')
);

Where the array keys are the attributes that you have made configurable product attributes (Global, Dropdown, Is used in Configurable Products)

That's all there is to it. I won't go into the save function because hopefully that one will just work for you.

Tuesday, July 9, 2013

Php Curl Multipart form posting

Let's say you want to post some data to a form using curl:

$url = 'http://www.example.com/';
$data = array('foo' => '1', 'bar' => '2');

Ordinarily you create the post body using http_build_query(). But let's say the form is expecting the form data to be multipart encoded. Now we've got a challenge. First let's create a function to do the multipart encoding:

function multipart_build_query($fields, $boundary){
  $retval = '';
  foreach($fields as $key => $value){
    $retval .= "--$boundary\nContent-Disposition: form-data; name=\"$key\"\n\n$value\n";
  }
  $retval .= "--$boundary--";
  return $retval;
}

The boundary is a string that separates the fields. It can be any string you want but you should choose something long enough that it won't randomly show up in your data.

$boundary = '--myboundary-xxx';
$body = multipart_build_query($data, $boundary);


Now make your curl post, but remember to set the content type:
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Content-Type: multipart/form-data; boundary=$boundary"));
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
$response = curl_exec($ch);

Of course, you could have just used PGBrowser to submit the form, and it will automatically detect when it needs to use multipart encoding. But that would be too easy.

Sunday, July 7, 2013

Php - scrape website with rotating proxies

Have you ever tried to scrape a website that imposes per-ip-address request limits? Let's take a look at how we can get around that in php by rotating through an array of proxies. This tutorial uses the PGBrowser scraping library.

First we will set up our proxies:

$proxies = array(
  'proxy1.com:80', 
  'proxy2.com:80', 
  'proxy3.com:80'
);

Next we want to define what a good response looks like. You might want to check the title or the full html for a specific string. For this tutorial though we're counting any page that has html of length greater than 0 bytes.

function response_ok($page){
  return strlen($page->html) > 0
}

Now instead of using $browser->get(), we are going to use a custom get() function that checks responses and rotates proxies as needed:

function get($url){
  global $browser, $proxies;
  $page = $browser->get($url);
  
  while(!response_ok($page)){
    $proxy = array_shift($proxies); # grab the first proxy
    array_push($proxies, $proxy); # push it back to the end
    echo "switching proxy to $proxy\n";
    list($host,$port) = explode(':', $proxy);
    $browser->setProxy($host, $port);
    $page = $browser->get($url);
  }
  return $page;
}


That's it, now we just call get() in the way we would ordinarily call $browser->get():
$browser = new PGBrowser();
$page = get('http://myurl.com'); # scrape with rotating proxy support

Saturday, June 15, 2013

Ruby - keep track of urls you've already visited

How do I keep track or urls that I've already visit in my ruby projects? Here's what I do:

def visited? url
  @visited ||= []
  return true if @visited.include? url
  @visited << url
  false
end

Now I've got a visited? method that tells me if I've called it already on an url

visited?('http://www.google.com')
#=> false
visited?('http://www.google.com')
#=> true

That makes it easy to just do something like:

scrape(url) unless visited?(url)
This was a short post but this is something I always put in my ruby code and I'll be referencing it later.

Thursday, June 13, 2013

Replace ruby's URI parser with Addressable

Today I was trying to read an url in ruby that URI didn't like the url of:

require 'open-uri'
open 'http://foo_bar.baz.com/'

generic.rb:213:in `initialize': the scheme http does not accept registry part: foo_bar.baz.com (or bad hostname?) (URI::InvalidURIError)

D'oh! This is a valid url, but sometimes URI can be a little bit old-fashioned about what to accept.

The solution? Addressable is a more RFC-conformant replacement for URI. But how to get open-uri and other libs to use it?

After poking around the internet for an hour or so and not coming up with anything I settled on this:

require 'addressable/uri'

class URI::Parser
  def split url
    a = Addressable::URI::parse url
    [a.scheme, a.userinfo, a.host, a.port, nil, a.path, nil, a.query, a.fragment]
  end
end

open 'http://foo_bar.baz.com/'

Yay! No parse error (obviously the url still won't open because I made it up.)

Notice that I threw away 2 parts of the url, registry and opaque. These are things that addressable doesn't have and I never see anyway so I don't expect it to be a problem.

I'll just start putting this bit of code into all my projects from now on and we'll have to wait and see if it creates any problems.

Saturday, April 20, 2013

Finding a good proxy provider

The other day I was looking for a good free or cheap uk proxy for a scraper script I was writing for a client. My goto proxy was an ec2 micro instance in the Ireland zone, but for some reason this wasn't good enough so I went hunting around for one based in England.

I found a lot of free listings (xroxy.com) but all the http/s proxies on the list were either down or not in uk like they claimed.

Then I found these guys, $4 for a month - I decided to take a chance.

I decided that I'm happy with the experience, a fast connection that doesn't get blocked by some of the sites that will block ec2 traffic.

Watch out for the recurring charge though. I noticed that they will keep charging me until I cancel.

Oh yeah, I am not affiliated in any way, blah blah blah.

Monday, April 15, 2013

Scrape a Website in Php with a Network Cache

Sometimes it's helpful to use cached responses in a scraping project. Like when you're running a long scrape job and you're afraid the script will crash halfway through, or when you're fine-tuning a css selector or xpath expression 5 requests into your script and are losing productivity to the 30 second delay.

Now you can avoid those delays with PGBrowser's useCache property

$b = new PGBrowser();
$b->useCache = true;

Let's try it out with a test I like to use for the forms functionality:

require 'pgbrowser.php';
$b = new PGBrowser();
$b->useCache = true;
$page = $b->get('http://www.google.com/');
$form = $page->form();
$form->set('q', 'foo');
$page = $form->submit();
echo preg_match ('/foo - /', $page->title) ? 'success' : 'failure';

The first time ruinning this script takes about 6 seconds. The responses get saved in a folder called cache, and the next time you run it should only take about 1 second.
View the project or download the latest source.

Sunday, March 10, 2013

PGBrowser plus phpQuery or Simple Html Dom

Just dropping a quick notice that PGBrowser now lets you query pages with css selectors when used with phpquery or simple-html-dom:

require 'pgbrowser.php';
require 'phpquery.php';

$browser = new PGBrowser('phpquery');
$page = $browser->get('http://www.google.com/search?q=php');
foreach($page->search('li.g') as $li){
  echo $page->at('a', $li)->text() . "\n";
}

Saturday, March 9, 2013

Ruby: Rate limiting concurrent downloads

Yesterday an interesting question was posed on stackoverflow, how to ensure your script doesn't scrape a website or API too fast when making concurrent requests. Like so many interesting questions, this one was deemed to be not a real question and closed by moderators. So today I'll share my thoughts on the subject here. Let's avoid the complication of using Event Machine for this one, which, I could argue, creates as many problems as it solves.

First we're going to set up a queue and some variables. We'll use open-uri for the downloads to make it easy:
require 'open-uri'

queue = [
'http://www.google.com/',
'http://www.bing.com/',
'http://www.yahoo.com/',
'http://www.wikipedia.com/',
'http://www.amazon.com/'
]

num_threads = 3 # more is better, memory permitting
delay_per_request = 1 # in seconds

Next we create our threads and give them something to do. In a real script you'll need them to do something interesting but for this purpose they will just print out the url and response body size:
threads = []

num_threads.times do
  threads << Thread.new do
    Thread.exit unless url = queue.pop
    puts "#{url} is #{open(url).read.length} bytes long"
  end
end

Now that we have our threads we want to 'join' them. We also want to time them to see how long they took:
start = Time.now
threads.each{|t| t.join}
elapsed = Time.now - start

If they finished too quickly we need to take a short nap, otherwise we're free to continue processing the queue
time_to_sleep = num_threads * delay_per_request - elapsed
if time_to_sleep > 0
  puts "sleeping for #{time_to_sleep} seconds"
  sleep time_to_sleep
end

Ok, so now it's time to put it all together and process the queue in a loop.
require 'open-uri'

queue = [
'http://www.google.com/',
'http://www.bing.com/',
'http://www.yahoo.com/',
'http://www.wikipedia.com/',
'http://www.amazon.com/'
]

num_threads = 3 # more is better, memory permitting
delay_per_request = 1 # in seconds

until queue.empty?
  threads = []

  num_threads.times do
    threads << Thread.new do
      Thread.exit unless url = queue.pop
      puts "#{url} is #{open(url).read.length} bytes long"
    end
  end

  start = Time.now
  threads.each{|t| t.join}
  elapsed = Time.now - start

  time_to_sleep = num_threads * delay_per_request - elapsed
  if time_to_sleep > 0
    puts "sleeping for #{time_to_sleep} seconds"
    sleep time_to_sleep
  end
end

If you found this useful, let me know.

Tuesday, January 22, 2013

Give php strings easy to remember regex functionality

Do you ever get tired of looking up preg functions because you forgot the order of the arguments? Me too. Is it pattern, needle, haystack? Or needle, haystack, pattern?
Save yourself the headache and just include Phpstr:

require 'phpstr.php';

Return a match
str('There are 23 people reading this blog')->match('/\d+/');

Substitution
str('all of the es')->gsub('/e/', 'y');

Scan will return an array of matches
str('010 202 312 332')->scan('/\d+/');

Split will return an array of tokens
str('010 202-312 332')->split('/\s/');

Isn't that so much easier?
View the project or download the source.