ScraperBlog

Sunday, June 5, 2016

What Manta is Doing to Prevent Scraping

Manta is partnered with a CDN called Distil Networks. You can read about what they're doing to discourage scraping here:

https://www.linkedin.com/pulse/what-manta-doing-prevent-scraping-p-guardiario

Monday, August 18, 2014

How to Find Good Free Site-Specific Proxies

I'm always looking for a source for good free proxies for my scraping projects. Have you ever tried scraping a site that blocks by geography or limits requests by IP addresses? If so then you will appreciate this post.

Having recently stumbled upon Jrim's Proxy Multiply, it's now my go-to tool for grabbing some quick free proxies. I just hit the "Get proxies!" button and within an hour or two I've got 3,000 tested working proxies.

And if you need some site-specific proxies, it can do that too. Just follow these steps:

open the Configure menu
go to HTTP Proxy Testing
check Check proxies against a specific url and click configure
in the dialog that opens, put the homepage (http://www.manta.com, http://www.yelp.com, etc) in the Url to Navigate to textfield
in the Content to look for in the page you want to put any text that will always show up on a page (I like to use the google-analytics id)

Thursday, November 28, 2013

How to Import Configurable Product Data into Magento

This is really part 2 of a 2 part tutorial on scraping configurable product data, which began here. For those people not interested in the scraping aspect of this project, you can download the magento sample data here. Warning, this sample data might be NSFW if you have an uptight boss.

The first thing we're going to need to do is set up our categories. Open the products spreadsheet in Excel or something similar and Highlight the _category column (column L) and copy that. Bring that into a text editor and sort the values / remove duplicates. Each unique needs to be a category in Magento or you will get an import error. When you create the the categories, set them to 'Enabled' so they show up in your storefront. Drag all your categories into the "Default Category" root category after you make them.

Next we need to set up the attributes. Copy columns _super_attribute_code and _super_attribute_option (S and T) into your text editor and sort/remove duplicates. You'll see unique values for 2 attributes, color and size. These both need to be set up in Magento.

So go to Catalog / Attributes/ Manage Attributes, create a new color attribute if it doesn't exist and set:

Scope - Global
Catalog Input Type for Store Owner - Dropdown
Use To Create Configurable Product - Yes

Next click on Manage Label / Options and add your color options. Then do the same for size.

Now you're all set for a clean import. Head over to System - Import / Export - Import, select Products, and upload the spreadsheet. If you are using Magent Go you can import the Product Images in the same way. Otherwise you'll want to ftp those to your media folder.

Wednesday, November 20, 2013

Scrape a Website to Magento Configurable Product Import Format

Today I'm going to show how to scrape store products and export them to Magento's import format and keep the configurable product options that are associated. Like most things that involve Magento, this required a lot of patience and trial and error.

The goal's of the project are:

Learn how to scrape ecommerce data to Magento's configurable product import format
Get some sexy Magento sample store data for use in future testing and mock-ups

The project will use 2 libraries, a simple CSV class, and PGBrowser for the scraping. For the sake of simplicity (or not, depending on your point of view) I will use xpath expeessions to get the data rather than use Simple Html Dom or Phpquery. The full source for this project can be downloaded here.

Let's go over some of the code. First we instantiate our CSV object (yes, it's a global variable. I'm okay with that.) Then we load the listings page and iterate through each listing. Pretty self explanatory so far.

$csv = new CSV('products.csv', $fields, ",", null); // no utf-8 BOM

// and start scraping
$url = 'http://www.spicylingerie.com/';
$page = $browser->get($url);

foreach($page->search('//div[@class="fp-pro-name"]/a') as $a){
  scrape($a);
  echo '.';
}

So now we pass the

a elements that have the details page urls to our scrape function. Because we earlier did $browser->convertUrls = true we no longer need to worry about converting our relative hrefs to absolute urls. The library took care of that for us.



Now we get the page for the link and start building our $item array which we will pass to the save() function. Other than the ugly expression for description this was easy.





$url = $a->getAttribute('href');

$page = $browser->get($url);
$item = array();
$item['name'] = trim($a->nodeValue);
$item['description'] = $item['short_description'] = trim($page->at('//div[@class="pro-det-head"]/h4/text()[normalize-space()][position()=last()]')->nodeValue);

if(!preg_match('/Sale price: \$(\d+\.\d{2})/', $page->body, $m)) die('missing price!');
$item['price'] = $m[1];

if(!preg_match('/Style# : ([\w-]+)/', $page->body, $m)) die('missing sku!');
$item['sku'] = $m[1];


Next we save the image, for later import/upload - identify the categories we care about - and construct our items. The options need to look like:





$options = array(
 array('size' => '12', 'color' => 'purple'),
 array('size' => '10', 'color' => 'yellow')
);



Where the array keys are the attributes that you have made configurable product attributes (Global, Dropdown, Is used in Configurable Products)



That's all there is to it. I won't go into the save function because hopefully that one will just work for you.




Posted by





pguardiario




at

5:28 PM



8 comments:
              









Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest



Labels:
configurable product,
csv,
import,
magento,
pgbrowser,
php


        

          
        
Tuesday, July 9, 2013

          
        




Php Curl Multipart form posting





Let's say you want to post some data to a form using curl:




$url = 'http://www.example.com/';
$data = array('foo' => '1', 'bar' => '2');



Ordinarily you create the post body using http_build_query(). But let's say the form is expecting the form data to be multipart encoded. Now we've got a challenge. First let's create a function to do the multipart encoding:




function multipart_build_query($fields, $boundary){
  $retval = '';
  foreach($fields as $key => $value){
    $retval .= "--$boundary\nContent-Disposition: form-data; name=\"$key\"\n\n$value\n";
  }
  $retval .= "--$boundary--";
  return $retval;
}



The boundary is a string that separates the fields. It can be any string you want but you should choose something long enough that it won't randomly show up in your data.




$boundary = '--myboundary-xxx';
$body = multipart_build_query($data, $boundary);





Now make your curl post, but remember to set the content type:


$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Content-Type: multipart/form-data; boundary=$boundary"));
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $body);
$response = curl_exec($ch);



Of course, you could have just used PGBrowser to submit the form, and it will automatically detect when it needs to use multipart encoding. But that would be too easy.




Posted by





pguardiario




at

11:10 PM



81 comments:
              









Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest



Labels:
curl,
encoding,
form,
multipart,
pgbrowser,
php









          
        

          
        
Sunday, July 7, 2013

          
        




Php - scrape website with rotating proxies





Have you ever tried to scrape a website that imposes per-ip-address request limits? Let's take a look at how we can get around that in php by rotating through an array of proxies. This tutorial uses the PGBrowser scraping library.

First we will set up our proxies:




$proxies = array(
  'proxy1.com:80', 
  'proxy2.com:80', 
  'proxy3.com:80'
);




Next we want to define what a good response looks like. You might want to check the title or the full html for a specific string. For this tutorial though we're counting any page that has html of length greater than 0 bytes.





function response_ok($page){
  return strlen($page->html) > 0
}




Now instead of using $browser->get(), we are going to use a custom get() function that checks responses and rotates proxies as needed:




function get($url){
  global $browser, $proxies;
  $page = $browser->get($url);
  
  while(!response_ok($page)){
    $proxy = array_shift($proxies); # grab the first proxy
    array_push($proxies, $proxy); # push it back to the end
    echo "switching proxy to $proxy\n";
    list($host,$port) = explode(':', $proxy);
    $browser->setProxy($host, $port);
    $page = $browser->get($url);
  }
  return $page;
}





That's it, now we just call get() in the way we would ordinarily call $browser->get():


$browser = new PGBrowser();
$page = get('http://myurl.com'); # scrape with rotating proxy support





Posted by





pguardiario




at

7:49 PM



1 comment:
              









Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest



Labels:
pgbrowser,
php,
proxies,
rotating,
scrape









          
        

          
        
Saturday, June 15, 2013

          
        




Ruby - keep track of urls you've already visited





How do I keep track or urls that I've already visit in my ruby projects? Here's what I do:



def visited? url
  @visited ||= []
  return true if @visited.include? url
  @visited << url
  false
end



Now I've got a visited? method that tells me if I've called it already on an url



visited?('http://www.google.com')
#=> false
visited?('http://www.google.com')
#=> true



That makes it easy to just do something like:



scrape(url) unless visited?(url)

This was a short post but this is something I always put in my ruby code and I'll be referencing it later.




Posted by





pguardiario




at

8:43 PM



No comments:
              









Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest



Labels:
ruby,
scrape,
url,
visited




Older Posts

Home




Subscribe to:
Posts (Atom)












AddThis






Pages



Home


Contact





Blog Archive








        ▼ 
      



2016

(1)





        ▼ 
      



June

(1)

What Manta is Doing to Prevent Scraping










        ► 
      



2014

(1)





        ► 
      



August

(1)









        ► 
      



2013

(11)





        ► 
      



November

(2)







        ► 
      



July

(2)







        ► 
      



June

(2)







        ► 
      



April

(2)







        ► 
      



March

(2)







        ► 
      



January

(1)









        ► 
      



2012

(11)





        ► 
      



December

(3)







        ► 
      



November

(6)







        ► 
      



October

(2)









About Me




pguardiario



View my complete profile



























Simple theme. Powered by Blogger.