I decided to release this code snip, just a PHP class, for extracting data from an HTML page.
The code is in https://github.com/danielecr/verygrabber and the packagist can be installed by
composer install smartango/verygrabber
rif. https://packagist.org/packages/smartango/verygrabber
This is an old snippet to define a grabber by XPath deploying the recursive nature of DOM (Document Object Model) provided by the PHP class DOMDocument
There is an undocumented feature “$do_debug” that is a global variable used to switch to interactive debug in case of error on specific object type “array”.
In tests directory there is an old html for testing, and the associated JSON structure to parse it.
This project is dated somewhere around 2016, and the code wasn’t updated so far.
The schema.json format is very simple:
- define a root object with attribute
"type": "array"
to point to the container of a table, a set of divs, or any kind of repeated elements - specify
"repeatOn": "//xpath/to/containing/element"
on the root object - define in “elementDef” attribute the grabbing&extracting directives, possibly nesting type:array if there are repeated elements inside the repeted element (as you guessed, here is were the recursive nature of DOM is deployed)
While it is an old project, it has some interesting idea, and something that could be done better.
Improvements
… see wiki in github project
If interested fork and send PR.
If you need assistance, adaption, or other, contact me at daniele@smartango.com, refer to about page if you prefer others channels