This is really part 2 of a 2 part tutorial on scraping configurable product data, which began here. For those people not interested in the scraping aspect of this project, you can download the magento sample data here. Warning, this sample data might be NSFW if you have an uptight boss.
The first thing we're going to need to do is set up our categories. Open the products spreadsheet in Excel or something similar and Highlight the _category column (column L) and copy that. Bring that into a text editor and sort the values / remove duplicates. Each unique needs to be a category in Magento or you will get an import error. When you create the the categories, set them to 'Enabled' so they show up in your storefront. Drag all your categories into the "Default Category" root category after you make them.
Next we need to set up the attributes. Copy columns _super_attribute_code and _super_attribute_option (S and T) into your text editor and sort/remove duplicates. You'll see unique values for 2 attributes, color and size. These both need to be set up in Magento.
So go to Catalog / Attributes/ Manage Attributes, create a new color attribute if it doesn't exist and set:
Scope - Global
Catalog Input Type for Store Owner - Dropdown
Use To Create Configurable Product - Yes
Next click on Manage Label / Options and add your color options. Then do the same for size.
Now you're all set for a clean import. Head over to System - Import / Export - Import, select Products, and upload the spreadsheet. If you are using Magent Go you can import the Product Images in the same way. Otherwise you'll want to ftp those to your media folder.
Thursday, November 28, 2013
Wednesday, November 20, 2013
Scrape a Website to Magento Configurable Product Import Format
Today I'm going to show how to scrape store products and export them to Magento's import format and keep the configurable product options that are associated. Like most things that involve Magento, this required a lot of patience and trial and error.
The goal's of the project are:
Let's go over some of the code. First we instantiate our CSV object (yes, it's a global variable. I'm okay with that.) Then we load the listings page and iterate through each listing. Pretty self explanatory so far.
So now we pass the
The goal's of the project are:
- Learn how to scrape ecommerce data to Magento's configurable product import format
- Get some sexy Magento sample store data for use in future testing and mock-ups
Let's go over some of the code. First we instantiate our CSV object (yes, it's a global variable. I'm okay with that.) Then we load the listings page and iterate through each listing. Pretty self explanatory so far.
$csv = new CSV('products.csv', $fields, ",", null); // no utf-8 BOM // and start scraping $url = 'http://www.spicylingerie.com/'; $page = $browser->get($url); foreach($page->search('//div[@class="fp-pro-name"]/a') as $a){ scrape($a); echo '.'; }
So now we pass the
a elements that have the details page urls to our scrape function. Because we earlier did $browser->convertUrls = true we no longer need to worry about converting our relative hrefs to absolute urls. The library took care of that for us.
Now we get the page for the link and start building our $item
array which we will pass to the
save() function. Other than the ugly expression for description this was easy.
$url = $a->getAttribute('href');
$page = $browser->get($url);
$item = array();
$item['name'] = trim($a->nodeValue);
$item['description'] = $item['short_description'] = trim($page->at('//div[@class="pro-det-head"]/h4/text()[normalize-space()][position()=last()]')->nodeValue);
if(!preg_match('/Sale price: \$(\d+\.\d{2})/', $page->body, $m)) die('missing price!');
$item['price'] = $m[1];
if(!preg_match('/Style# : ([\w-]+)/', $page->body, $m)) die('missing sku!');
$item['sku'] = $m[1];
Next we save the image, for later import/upload - identify the categories we care about - and construct our items. The options need to look like:
$options = array(
array('size' => '12', 'color' => 'purple'),
array('size' => '10', 'color' => 'yellow')
);
Where the array keys are the attributes that you have made configurable product attributes (Global, Dropdown, Is used in Configurable Products)
That's all there is to it. I won't go into the save function because hopefully that one will just work for you.
Tuesday, July 9, 2013
Php Curl Multipart form posting
Let's say you want to post some data to a form using curl:
Ordinarily you create the post body using
The boundary is a string that separates the fields. It can be any string you want but you should choose something long enough that it won't randomly show up in your data.
Now make your curl post, but remember to set the content type:
Of course, you could have just used PGBrowser to submit the form, and it will automatically detect when it needs to use multipart encoding. But that would be too easy.
$url = 'http://www.example.com/'; $data = array('foo' => '1', 'bar' => '2');
Ordinarily you create the post body using
http_build_query()
. But let's say the form is expecting the form data to be multipart encoded. Now we've got a challenge. First let's create a function to do the multipart encoding:
function multipart_build_query($fields, $boundary){ $retval = ''; foreach($fields as $key => $value){ $retval .= "--$boundary\nContent-Disposition: form-data; name=\"$key\"\n\n$value\n"; } $retval .= "--$boundary--"; return $retval; }
The boundary is a string that separates the fields. It can be any string you want but you should choose something long enough that it won't randomly show up in your data.
$boundary = '--myboundary-xxx'; $body = multipart_build_query($data, $boundary);
Now make your curl post, but remember to set the content type:
$ch = curl_init(); curl_setopt($ch, CURLOPT_HTTPHEADER, array("Content-Type: multipart/form-data; boundary=$boundary")); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_POST, true); curl_setopt($ch, CURLOPT_POSTFIELDS, $body); $response = curl_exec($ch);
Of course, you could have just used PGBrowser to submit the form, and it will automatically detect when it needs to use multipart encoding. But that would be too easy.
Sunday, July 7, 2013
Php - scrape website with rotating proxies
Have you ever tried to scrape a website that imposes per-ip-address request limits? Let's take a look at how we can get around that in php by rotating through an array of proxies. This tutorial uses the PGBrowser scraping library.
First we will set up our proxies:
Next we want to define what a good response looks like. You might want to check the title or the full html for a specific string. For this tutorial though we're counting any page that has html of length greater than 0 bytes.
Now instead of using
That's it, now we just call
First we will set up our proxies:
$proxies = array( 'proxy1.com:80', 'proxy2.com:80', 'proxy3.com:80' );
Next we want to define what a good response looks like. You might want to check the title or the full html for a specific string. For this tutorial though we're counting any page that has html of length greater than 0 bytes.
function response_ok($page){ return strlen($page->html) > 0 }
Now instead of using
$browser->get()
, we are going to use a custom get()
function that checks responses and rotates proxies as needed:
function get($url){ global $browser, $proxies; $page = $browser->get($url); while(!response_ok($page)){ $proxy = array_shift($proxies); # grab the first proxy array_push($proxies, $proxy); # push it back to the end echo "switching proxy to $proxy\n"; list($host,$port) = explode(':', $proxy); $browser->setProxy($host, $port); $page = $browser->get($url); } return $page; }
That's it, now we just call
get()
in the way we would ordinarily call $browser->get()
:
$browser = new PGBrowser(); $page = get('http://myurl.com'); # scrape with rotating proxy support
Saturday, June 15, 2013
Ruby - keep track of urls you've already visited
How do I keep track or urls that I've already visit in my ruby projects? Here's what I do:
Now I've got a
That makes it easy to just do something like:
def visited? url @visited ||= [] return true if @visited.include? url @visited << url false end
Now I've got a
visited?
method that tells me if I've called it already on an urlvisited?('http://www.google.com') #=> false visited?('http://www.google.com') #=> true
That makes it easy to just do something like:
scrape(url) unless visited?(url)This was a short post but this is something I always put in my ruby code and I'll be referencing it later.
Thursday, June 13, 2013
Replace ruby's URI parser with Addressable
Today I was trying to read an url in ruby that URI didn't like the url of:
generic.rb:213:in `initialize': the scheme http does not accept registry part: foo_bar.baz.com (or bad hostname?) (URI::InvalidURIError)
D'oh! This is a valid url, but sometimes URI can be a little bit old-fashioned about what to accept.
The solution? Addressable is a more RFC-conformant replacement for URI. But how to get open-uri and other libs to use it?
After poking around the internet for an hour or so and not coming up with anything I settled on this:
Yay! No parse error (obviously the url still won't open because I made it up.)
Notice that I threw away 2 parts of the url, registry and opaque. These are things that addressable doesn't have and I never see anyway so I don't expect it to be a problem.
I'll just start putting this bit of code into all my projects from now on and we'll have to wait and see if it creates any problems.
require 'open-uri' open 'http://foo_bar.baz.com/'
generic.rb:213:in `initialize': the scheme http does not accept registry part: foo_bar.baz.com (or bad hostname?) (URI::InvalidURIError)
D'oh! This is a valid url, but sometimes URI can be a little bit old-fashioned about what to accept.
The solution? Addressable is a more RFC-conformant replacement for URI. But how to get open-uri and other libs to use it?
After poking around the internet for an hour or so and not coming up with anything I settled on this:
require 'addressable/uri' class URI::Parser def split url a = Addressable::URI::parse url [a.scheme, a.userinfo, a.host, a.port, nil, a.path, nil, a.query, a.fragment] end end open 'http://foo_bar.baz.com/'
Yay! No parse error (obviously the url still won't open because I made it up.)
Notice that I threw away 2 parts of the url, registry and opaque. These are things that addressable doesn't have and I never see anyway so I don't expect it to be a problem.
I'll just start putting this bit of code into all my projects from now on and we'll have to wait and see if it creates any problems.
Saturday, April 20, 2013
Finding a good proxy provider
The other day I was looking for a good free or cheap uk proxy for a scraper script I was writing for a client. My goto proxy was an ec2 micro instance in the Ireland zone, but for some reason this wasn't good enough so I went hunting around for one based in England.
I found a lot of free listings (xroxy.com) but all the http/s proxies on the list were either down or not in uk like they claimed.
Then I found these guys, $4 for a month - I decided to take a chance.
I decided that I'm happy with the experience, a fast connection that doesn't get blocked by some of the sites that will block ec2 traffic.
Watch out for the recurring charge though. I noticed that they will keep charging me until I cancel.
Oh yeah, I am not affiliated in any way, blah blah blah.
I found a lot of free listings (xroxy.com) but all the http/s proxies on the list were either down or not in uk like they claimed.
Then I found these guys, $4 for a month - I decided to take a chance.
I decided that I'm happy with the experience, a fast connection that doesn't get blocked by some of the sites that will block ec2 traffic.
Watch out for the recurring charge though. I noticed that they will keep charging me until I cancel.
Oh yeah, I am not affiliated in any way, blah blah blah.
Monday, April 15, 2013
Scrape a Website in Php with a Network Cache
Sometimes it's helpful to use cached responses in a scraping project. Like when you're running a long scrape job and you're afraid the script will crash halfway through, or when you're fine-tuning a css selector or xpath expression 5 requests into your script and are losing productivity to the 30 second delay.
Now you can avoid those delays with PGBrowser's useCache property
Let's try it out with a test I like to use for the forms functionality:
The first time ruinning this script takes about 6 seconds. The responses get saved in a folder called cache, and the next time you run it should only take about 1 second.
View the project or download the latest source.
Now you can avoid those delays with PGBrowser's useCache property
$b = new PGBrowser(); $b->useCache = true;
Let's try it out with a test I like to use for the forms functionality:
require 'pgbrowser.php'; $b = new PGBrowser(); $b->useCache = true; $page = $b->get('http://www.google.com/'); $form = $page->form(); $form->set('q', 'foo'); $page = $form->submit(); echo preg_match ('/foo - /', $page->title) ? 'success' : 'failure';
The first time ruinning this script takes about 6 seconds. The responses get saved in a folder called cache, and the next time you run it should only take about 1 second.
View the project or download the latest source.
Sunday, March 10, 2013
PGBrowser plus phpQuery or Simple Html Dom
Just dropping a quick notice that PGBrowser now lets you query pages with css selectors when used with phpquery or simple-html-dom:
require 'pgbrowser.php'; require 'phpquery.php'; $browser = new PGBrowser('phpquery'); $page = $browser->get('http://www.google.com/search?q=php'); foreach($page->search('li.g') as $li){ echo $page->at('a', $li)->text() . "\n"; }
Saturday, March 9, 2013
Ruby: Rate limiting concurrent downloads
Yesterday an interesting question was posed on stackoverflow, how to ensure your script doesn't scrape a website or API too fast when making concurrent requests. Like so many interesting questions, this one was deemed to be not a real question and closed by moderators. So today I'll share my thoughts on the subject here. Let's avoid the complication of using Event Machine for this one, which, I could argue, creates as many problems as it solves.
First we're going to set up a queue and some variables. We'll use open-uri for the downloads to make it easy:
Next we create our threads and give them something to do. In a real script you'll need them to do something interesting but for this purpose they will just print out the url and response body size:
Now that we have our threads we want to 'join' them. We also want to time them to see how long they took:
If they finished too quickly we need to take a short nap, otherwise we're free to continue processing the queue
Ok, so now it's time to put it all together and process the queue in a loop.
If you found this useful, let me know.
First we're going to set up a queue and some variables. We'll use open-uri for the downloads to make it easy:
require 'open-uri' queue = [ 'http://www.google.com/', 'http://www.bing.com/', 'http://www.yahoo.com/', 'http://www.wikipedia.com/', 'http://www.amazon.com/' ] num_threads = 3 # more is better, memory permitting delay_per_request = 1 # in seconds
Next we create our threads and give them something to do. In a real script you'll need them to do something interesting but for this purpose they will just print out the url and response body size:
threads = [] num_threads.times do threads << Thread.new do Thread.exit unless url = queue.pop puts "#{url} is #{open(url).read.length} bytes long" end end
Now that we have our threads we want to 'join' them. We also want to time them to see how long they took:
start = Time.now threads.each{|t| t.join} elapsed = Time.now - start
If they finished too quickly we need to take a short nap, otherwise we're free to continue processing the queue
time_to_sleep = num_threads * delay_per_request - elapsed if time_to_sleep > 0 puts "sleeping for #{time_to_sleep} seconds" sleep time_to_sleep end
Ok, so now it's time to put it all together and process the queue in a loop.
require 'open-uri' queue = [ 'http://www.google.com/', 'http://www.bing.com/', 'http://www.yahoo.com/', 'http://www.wikipedia.com/', 'http://www.amazon.com/' ] num_threads = 3 # more is better, memory permitting delay_per_request = 1 # in seconds until queue.empty? threads = [] num_threads.times do threads << Thread.new do Thread.exit unless url = queue.pop puts "#{url} is #{open(url).read.length} bytes long" end end start = Time.now threads.each{|t| t.join} elapsed = Time.now - start time_to_sleep = num_threads * delay_per_request - elapsed if time_to_sleep > 0 puts "sleeping for #{time_to_sleep} seconds" sleep time_to_sleep end end
If you found this useful, let me know.
Labels:
api,
asynchronous,
concurrent,
download,
limit,
rate,
request,
ruby,
scrape
Tuesday, January 22, 2013
Give php strings easy to remember regex functionality
Do you ever get tired of looking up preg functions because you forgot the order of the arguments? Me too. Is it pattern, needle, haystack? Or needle, haystack, pattern?
Save yourself the headache and just include Phpstr:
Return a match
Substitution
Scan will return an array of matches
Split will return an array of tokens
Isn't that so much easier?
View the project or download the source.
Save yourself the headache and just include Phpstr:
require 'phpstr.php';
Return a match
str('There are 23 people reading this blog')->match('/\d+/');
Substitution
str('all of the es')->gsub('/e/', 'y');
Scan will return an array of matches
str('010 202 312 332')->scan('/\d+/');
Split will return an array of tokens
str('010 202-312 332')->split('/\s/');
Isn't that so much easier?
View the project or download the source.
Subscribe to:
Posts (Atom)