Saturday, November 3, 2012

Choosing a Php HTML parser

Today we'll be comparing some HTML parsing libraries for php and picking a winner. The ideal candidate will support css3 selectors, be DOM based easy to use. I've rounded up 3 candidates:
  • Simple HTML Dom - A favorite with most php users from what I've seen.
  • phpQuery - An interesting project, claiming to be a jQuery port.
  • Ganon - A new challenger.

The Setup

Mitt Romney
Barack Obama

The Test

# simple html dom
$doc = str_get_html($html);
echo $doc->find('label ~ span[id$=2]', 0)->innerText();

# phpquery
$doc = phpQuery::newDocumentHTML($html);
echo pq('label ~ span[id$=2]')->text();

# ganon
$doc = str_get_dom($html);
echo $doc('label ~ span[id$=2]', 0)->getInnerText();

The results

Simple html dom was the only one to fail the css3 test. This is clearly a deal breaker, without support for simple css like sibling selector (~), there's just no way I could justify looking any further at this library.

phpQuery passed the css3 test, its selection syntax feels the cleanest and it's Dom based, unlike the other two. I also like that it's PEAR installable.

Ganon also passed the css3 test, and it actually outperformed phpQuery for 10K iterations (7 seconds to phpQuery's 8). Definitely a strong contender.

The winner

phpQuery - Full css3 support, DOM based, PEAR installable, and a clean syntax edges out Ganon for the top spot this time.

1 comment: