Showing posts with label css. Show all posts
Showing posts with label css. Show all posts

Saturday, November 3, 2012

Choosing a Php HTML parser

Today we'll be comparing some HTML parsing libraries for php and picking a winner. The ideal candidate will support css3 selectors, be DOM based easy to use. I've rounded up 3 candidates:
  • Simple HTML Dom - A favorite with most php users from what I've seen.
  • phpQuery - An interesting project, claiming to be a jQuery port.
  • Ganon - A new challenger.

The Setup


Mitt Romney
Barack Obama

The Test

# simple html dom
require('simple_html_dom.php');
$doc = str_get_html($html);
echo $doc->find('label ~ span[id$=2]', 0)->innerText();

# phpquery
require('phpQuery.php');
$doc = phpQuery::newDocumentHTML($html);
phpQuery::selectDocument($doc);
echo pq('label ~ span[id$=2]')->text();

# ganon
require('ganon.php');
$doc = str_get_dom($html);
echo $doc('label ~ span[id$=2]', 0)->getInnerText();

The results

Simple html dom was the only one to fail the css3 test. This is clearly a deal breaker, without support for simple css like sibling selector (~), there's just no way I could justify looking any further at this library.

phpQuery passed the css3 test, its selection syntax feels the cleanest and it's Dom based, unlike the other two. I also like that it's PEAR installable.

Ganon also passed the css3 test, and it actually outperformed phpQuery for 10K iterations (7 seconds to phpQuery's 8). Definitely a strong contender.

The winner

phpQuery - Full css3 support, DOM based, PEAR installable, and a clean syntax edges out Ganon for the top spot this time.

Friday, November 2, 2012

Do it the right way with css

Looking around at other people's scraping projects I still see a lot of peple doing it the wrong way. And I'm not talking about parsing html with regex this time. I'm talking about using xpath expressions instead of css selectors. Let's take a look:
Barack Obama

There's 2 ways for me to get at the data I want:
# using xpath
doc.at('//label[@for="name"]/following-sibling::span').text
# using css
doc.at('label[for=name] + span').text

So which one is better? Unless you're a machine the answer is always css. Because css is a human-friendly way to select the data you want, your code will be easier to maintain than the hot mess created with xpath expressions. There's a good reason why web designers have been using it for so long.