Readability / Html Content / Article Extractor & Web Scrapping library written in PHP
-
Updated
Apr 29, 2021 - PHP
Configuration for attribute selection is using JMES Path. Better documentation is needed to understand and use this configuration.
Provide additionally some examples for different configuration
Add a description, image, and links to the article-extractor topic page so that developers can more easily learn about it.
To associate your repository with the article-extractor topic, visit your repo's landing page and select "manage topics."
I have mostly tested
trafilaturaon a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in [xpaths.py](https://github.com