Skip to content
fast python port of arc90's readability tool, updated to match latest readability.js!
HTML Python Makefile
Branch: master
Clone or download
#138 Compare This branch is 176 commits ahead, 1 commit behind timbertson:master.

Latest commit

buriy Merge pull request #136 from adbar/master
add coverage testing
Latest commit 5800210 Jan 30, 2020

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
doc Documentation draft Dec 29, 2019
readability Use black to format the code Jan 30, 2020
tests
.gitignore Adds tox configuration. Apr 29, 2015
.travis.yml add coverage tests Jan 30, 2020
Makefile Updated docs for positive_keywords and negative_keywords, cleaner imp… May 7, 2018
README.rst Syntax highlight the README Jan 9, 2020
requirements.txt Adds tox configuration. Apr 29, 2015
setup.py Use black to format the code Jan 30, 2020
tox.ini Skip missing interpreters in tox.ini Jan 28, 2020

README.rst

https://travis-ci.org/buriy/python-readability.svg?branch=master

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90's readability project.

Installation

It's easy using pip, just run:

$ pip install readability-lxml

Usage

>>> import requests
>>> from readability import Document

>>> response = requests.get('http://example.com')
>>> doc = Document(response.text)
>>> doc.title()
'Example Domain'

>>> doc.summary()
"""<html><body><div><body id="readabilityBody">\n<div>\n    <h1>Example Domain</h1>\n
<p>This domain is established to be used for illustrative examples in documents. You may
use this\n    domain in examples without prior coordination or asking for permission.</p>
\n    <p><a href="http://www.iana.org/domains/example">More information...</a></p>\n</div>
\n</body>\n</div></body></html>"""

Change Log

  • 0.8beta Replaced XHTML output with HTML5 output in summary() call.
  • 0.7.1 Support for Python 3.7 . Fixed a slowdown when processing documents with lots of spaces.
  • 0.7 Improved HTML5 tags handling. Fixed stripping unwanted HTML nodes (only first matching node was removed before).
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 - 3.6
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.3 Added Document.encoding, positive_keywords and negative_keywords

Licensing

This code is under the Apache License 2.0 license.

Thanks to

You can’t perform that action at this time.