require 'pgbrowser.php';
require 'phpquery.php';
$browser = new PGBrowser('phpquery');
$page = $browser->get('http://www.google.com/search?q=php');
foreach($page->search('li.g') as $li){
echo $page->at('a', $li)->text() . "\n";
}
Sunday, March 10, 2013
PGBrowser plus phpQuery or Simple Html Dom
Just dropping a quick notice that PGBrowser now lets you query pages with css selectors when used with phpquery or simple-html-dom:
Saturday, March 9, 2013
Ruby: Rate limiting concurrent downloads
Yesterday an interesting question was posed on stackoverflow, how to ensure your script doesn't scrape a website or API too fast when making concurrent requests. Like so many interesting questions, this one was deemed to be not a real question and closed by moderators. So today I'll share my thoughts on the subject here. Let's avoid the complication of using Event Machine for this one, which, I could argue, creates as many problems as it solves.
First we're going to set up a queue and some variables. We'll use open-uri for the downloads to make it easy:
Next we create our threads and give them something to do. In a real script you'll need them to do something interesting but for this purpose they will just print out the url and response body size:
Now that we have our threads we want to 'join' them. We also want to time them to see how long they took:
If they finished too quickly we need to take a short nap, otherwise we're free to continue processing the queue
Ok, so now it's time to put it all together and process the queue in a loop.
If you found this useful, let me know.
First we're going to set up a queue and some variables. We'll use open-uri for the downloads to make it easy:
require 'open-uri' queue = [ 'http://www.google.com/', 'http://www.bing.com/', 'http://www.yahoo.com/', 'http://www.wikipedia.com/', 'http://www.amazon.com/' ] num_threads = 3 # more is better, memory permitting delay_per_request = 1 # in seconds
Next we create our threads and give them something to do. In a real script you'll need them to do something interesting but for this purpose they will just print out the url and response body size:
threads = []
num_threads.times do
threads << Thread.new do
Thread.exit unless url = queue.pop
puts "#{url} is #{open(url).read.length} bytes long"
end
end
Now that we have our threads we want to 'join' them. We also want to time them to see how long they took:
start = Time.now
threads.each{|t| t.join}
elapsed = Time.now - start
If they finished too quickly we need to take a short nap, otherwise we're free to continue processing the queue
time_to_sleep = num_threads * delay_per_request - elapsed
if time_to_sleep > 0
puts "sleeping for #{time_to_sleep} seconds"
sleep time_to_sleep
end
Ok, so now it's time to put it all together and process the queue in a loop.
require 'open-uri'
queue = [
'http://www.google.com/',
'http://www.bing.com/',
'http://www.yahoo.com/',
'http://www.wikipedia.com/',
'http://www.amazon.com/'
]
num_threads = 3 # more is better, memory permitting
delay_per_request = 1 # in seconds
until queue.empty?
threads = []
num_threads.times do
threads << Thread.new do
Thread.exit unless url = queue.pop
puts "#{url} is #{open(url).read.length} bytes long"
end
end
start = Time.now
threads.each{|t| t.join}
elapsed = Time.now - start
time_to_sleep = num_threads * delay_per_request - elapsed
if time_to_sleep > 0
puts "sleeping for #{time_to_sleep} seconds"
sleep time_to_sleep
end
end
If you found this useful, let me know.
Labels:
api,
asynchronous,
concurrent,
download,
limit,
rate,
request,
ruby,
scrape
Subscribe to:
Comments (Atom)