PHP: Parsing HTML files with DOMDocument and DOMXpath
The DOMDocument PHP class allows us to take an HTML file or HTML text input and convert it into an object that can be easily traversed and queried similar to the way things are done in JavaScript.
Sample input
For the following examples we're working with a text input imported using the loadHTML() method, but you can just as easily import a local or remote HTML file using loadHTMLFile() instead.
The HTML is as follows, and we're aiming to extract the links and text just from the H2 elements inside the .blogArticle sections of the page - the highlighted text below - and ignore all other links:
<?PHP
$htmlinput = <<<EOT
<a href="#content">skip to content</a>
<div id="content">
<h1>H1 Heading</h1>
<p>Introductory text <a href="intro-link1.html">link1</a> and <a href="intro-link2.html">link2</a>.</p>
<div class="blogArticle">
<h2><a href="article1.html">Article #1 Title</a></h2>
<p>Introductory text ... <a href="article1.html">more »</a></p>
</div>
<a href="#top">Top</a>
<div class="blogArticle">
<h2><a href="article2.html">Article #2 Title</a></h2>
<p>Introductory text ... <a href="article2.html">more »</a></p>
</div>
<a href="#top">Top</a>
<div class="blogArticle">
<h2><a href="article3.html">Article #3 Title</a></h2>
<p>Introductory text ... <a href="article3.html">more »</a></p>
</div>
<a href="#top">Top</a>
<div class="blogArticle">
<h2><a href="article4.html">Article #4 Title</a></h2>
<p>Introductory text ... <a href="article4.html">more »</a></p>
</div>
<a href="#top">Top</a>
<p>Footer text <a href="footer-link.html">link</a>.</p>
</div>
<p><a href="copyright.html">Copyright © 2014</a></p>
EOT;
?>
This task would be trivial using regular expressions, but in more complicated situations the DOM approach has certain advantages.
Finding all links in the document
To find and extract all links from an HTML document we use the getElementsByTagName method which we're familiar with from JavaScript:
<?PHP
$doc = new \DOMDocument();
$doc->loadHTML($htmlinput);
$links = [];
// all links in document
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
?>
In this case all 17 links in the HTML are returned.
You'll notice that we've prefixed DOMDocument, and later DOMXpath, with a \. This is to make the code compatible with PHP namespaces. The alternative is to use use.
A slight improvement is to identify a containing element, in this case #content, and restrict the search that way making use of the getElementById method - also identical to it's JavaScript counterpart:
<?PHP
$doc = new \DOMDocument();
$doc->loadHTML($htmlinput);
$links = [];
// all links in #content
$container = $doc->getElementById("content");
$arr = $container->getElementsByTagName("a");
foreach($arr as $item) {
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
?>
This now excludes any links outside of the #content container, leaving us with 15 links.
getElementsByClassName equivalent
There is no actual getElementsByClassName (yet) in DOMDocument, but the same results can be produced using DOMXpath as follows:
<?PHP
$doc = new \DOMDocument();
$doc->loadHTML($htmlinput);
$xpath = new \DOMXpath($doc);
$articles = $xpath->query('//div[@class="blogArticle"]');
$links = [];
// all links in .blogArticle
foreach($articles as $container) {
$arr = $container->getElementsByTagName("a");
foreach($arr as $item) {
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
}
?>
Whereas in the previous example we searched for links in #content - a single element - we're now searching for links within multiple .blogArticle sections of the page.
The most complicated element here is the DOMXpath query //div[@class="blogArticle"], which targets all DIV elements having a className of blogArticle. In cases where there are multiple or similar class names this will need refining.
When making DOMXpath queries within another element, start the query string with .// and pass the container node as the second argument. For example:
$xpath->query('.//div[@class="post-details"]', $container);
The final step
Now we need to single out just the links having an H2 as their parent:
<?PHP
$doc = new \DOMDocument();
$doc->loadHTML($htmlinput);
$xpath = new \DOMXpath($doc);
$articles = $xpath->query('//div[@class="blogArticle"]');
$links = [];
// all links in h2's in .blogArticle
foreach($articles as $container) {
$arr = $container->getElementsByTagName("a");
foreach($arr as $item) {
if($item->parentNode->tagName == "h2") {
$href = $item->getAttribute("href");
$text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
$links[] = [
'href' => $href,
'text' => $text
];
}
}
}
?>
Finally the result we're after. The $links array now returns just four links matching the four article headings. Looking back you can see that these match the highlighted text in the input HTML.
Array
(
[0] => Array
(
[href] => article1.html
[text] => Article #1 Title
)
[1] => Array
(
[href] => article2.html
[text] => Article #2 Title
)
[2] => Array
(
[href] => article3.html
[text] => Article #3 Title
)
[3] => Array
(
[href] => article4.html
[text] => Article #4 Title
)
)
An identical approach can be used to find images in HTML - searching for the IMG tag name and using getAttribute to extract the SRC and other attributes.
If you're planning to use this code to spider websites, you should also read our related article on reading and obeying robots.txt.
References
- PHP.net: The DOMDocument class
- PHP.net: The DOMXPath class
Related Articles - Parsing files
- PHP Parsing HTML files with DOMDocument and DOMXpath
- PHP Parsing HTML to find Links
- PHP Listing files in a ZIP archive
- PHP Parsing robots.txt
- PHP Stripping invalid Unicode for pdfTeX
JS 17 October, 2016
You should edit your tutorial because right now none of your examples are working
The examples are working, which you can see from the output displayed as it's generated in real-time. If you're getting errors you should check that you're inputting valid HTML.
Andre 23 January, 2015
HI,
Doesnt work anymore: PHP Fatal error: Call to undefined method DOMAttr::getAttribute()
You may be missing a PHP package such as php-xml containing the DOM function library.