How to scrape HTML page by DOM and XPath

I decided to release this code snip, just a PHP class, for extracting data from an HTML page. The code is in <a href="https://github.com/danielecr/verygrabber">https://github.com/danielecr/verygrabber</a> and the packagist can be installed by <blockquote class="wp-block-quote is-layout-flow wp-block-quote-is-layout-flow"> composer install smartango/verygrabber </blockquote> rif. <a href="https://packagist.org/packages/smartango/verygrabber">https://packagist.org/packages/smartango/verygrabber</a> This is an old snippet to define a grabber by XPath deploying the recursive nature of DOM (Document Object Model) provided by the PHP class <a href="https://www.php.net/manual/en/class.domdocument.php">DOMDocument</a> There is an undocumented feature “$do_debug” that is a global variable used to switch to interactive debug in case of error on specific object type “array”. In tests directory there is an old html for testing, and the associated JSON structure to parse it. This project is dated somewhere around 2016, and the code wasn’t updated so far. The <a href="https://github.com/danielecr/verygrabber/blob/main/tests/data/schema.json">schema.json</a> format is very simple: <ol class="wp-block-list"> <li>define a root object with attribute <code>"type": "array"</code> to point to the container of a table, a set of divs, or any kind of repeated elements</li> <li>specify <code>"repeatOn": "//xpath/to/containing/element"</code> on the root object</li> <li>define in “elementDef” attribute the grabbing&extracting directives, possibly nesting type:array if there are repeated elements inside the repeted element (as you guessed, here is were the recursive nature of DOM is deployed)</li> </ol> While it is an old project, it has some interesting idea, and something that could be done better. <h2 class="wp-block-heading">Improvements</h2> … see <a href="https://github.com/danielecr/verygrabber/wiki/Verygrabber" data-type="link" data-id="https://github.com/danielecr/verygrabber/wiki/Verygrabber" target="_blank" rel="noreferrer noopener">wiki in github project</a> If interested fork and send PR. If you need assistance, adaption, or other, contact me at daniele@smartango.com, refer to <a href="https://smartango.com/about-2/" data-type="page" data-id="107">about</a> page if you prefer others channels

I decided to release this code snip, just a PHP class, for extracting data from an HTML page.

The code is in https://github.com/danielecr/verygrabber and the packagist can be installed by

composer install smartango/verygrabber

rif. https://packagist.org/packages/smartango/verygrabber

This is an old snippet to define a grabber by XPath deploying the recursive nature of DOM (Document Object Model) provided by the PHP class DOMDocument

There is an undocumented feature “$do_debug” that is a global variable used to switch to interactive debug in case of error on specific object type “array”.

In tests directory there is an old html for testing, and the associated JSON structure to parse it.

This project is dated somewhere around 2016, and the code wasn’t updated so far.

The schema.json format is very simple:

define a root object with attribute "type": "array" to point to the container of a table, a set of divs, or any kind of repeated elements
specify "repeatOn": "//xpath/to/containing/element" on the root object
define in “elementDef” attribute the grabbing&extracting directives, possibly nesting type:array if there are repeated elements inside the repeted element (as you guessed, here is were the recursive nature of DOM is deployed)

While it is an old project, it has some interesting idea, and something that could be done better.

Improvements

… see wiki in github project

If interested fork and send PR.

If you need assistance, adaption, or other, contact me at daniele@smartango.com, refer to about page if you prefer others channels

Improvements

Related Posts

System topology of CI/CD Jenkins talking with Docker registry

Job Orchestrator with consume policy

Gateway API for Docker Swarm