The other day I was looking for a good free or cheap uk proxy for a scraper script I was writing for a client. My goto proxy was an ec2 micro instance in the Ireland zone, but for some reason this wasn't good enough so I went hunting around for one based in England.
I found a lot of free listings (xroxy.com) but all the http/s proxies on the list were either down or not in uk like they claimed.
Then I found these guys, $4 for a month - I decided to take a chance.
I decided that I'm happy with the experience, a fast connection that doesn't get blocked by some of the sites that will block ec2 traffic.
Watch out for the recurring charge though. I noticed that they will keep charging me until I cancel.
Oh yeah, I am not affiliated in any way, blah blah blah.
Saturday, April 20, 2013
Monday, April 15, 2013
Scrape a Website in Php with a Network Cache
Sometimes it's helpful to use cached responses in a scraping project. Like when you're running a long scrape job and you're afraid the script will crash halfway through, or when you're fine-tuning a css selector or xpath expression 5 requests into your script and are losing productivity to the 30 second delay.
Now you can avoid those delays with PGBrowser's useCache property
Let's try it out with a test I like to use for the forms functionality:
The first time ruinning this script takes about 6 seconds. The responses get saved in a folder called cache, and the next time you run it should only take about 1 second.
View the project or download the latest source.
Now you can avoid those delays with PGBrowser's useCache property
$b = new PGBrowser(); $b->useCache = true;
Let's try it out with a test I like to use for the forms functionality:
require 'pgbrowser.php'; $b = new PGBrowser(); $b->useCache = true; $page = $b->get('http://www.google.com/'); $form = $page->form(); $form->set('q', 'foo'); $page = $form->submit(); echo preg_match ('/foo - /', $page->title) ? 'success' : 'failure';
The first time ruinning this script takes about 6 seconds. The responses get saved in a folder called cache, and the next time you run it should only take about 1 second.
View the project or download the latest source.
Subscribe to:
Posts (Atom)