ExtMainText -- Extract main text from html document

On this page... (hide)

1. Introduction
2. Download
3. Change Log
4. Usage
5. Reference

1. Introduction

This is a Python library, which could help extract main text from html document. In other words, this library will filter out ads, menus and other non main text part of html documents. Using such function, you could get pure content of articles from sites, and spiders could focus on the most valuable part of web pages.

Such implementation bases on the text density, which means "char length of plain text / char length of both text & tags" in 0.2 version. ExtMainText will pick up the largest piece of html fragment according to a specific threshold for such density, and consider this html fragment as the main text. The detailed description about this algorithm can be found on https://ai-depot.net/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/ .

You should choose suitable threshold for different sites, because they have different theme. In general, threshold 0.5 may be ok for English pages.

ExtMainText will output html fragment at last. If you want plain text, you can use html2text for further treatment.

Chinese Doc, See: MyProject.ExtMainText

(Edit)

2. Download

Version 0.2a: Attach:MyProject/ExtMainText_0.2a.zip
Version 0.2: Attach:MyProject/ExtMainText_0.2.zip
Version 0.1a: Attach:MyProject/ExtMainText_0.1a.zip
Version 0.1: Attach:MyProject/ExtMainText_0.1.zip

(Edit)

3. Change Log

2010-01-26: Add walk-around for possible encoding error in html documents.
2009-12-05: Transfered to lxml api, update density equation, and provide new filter mode.
2008-10-21: Add code about html2txt into "__main__" part, which will improve the output of direct execution. Published as version 0.1a.
2008-10-19: Initial implementation. Published as version 0.1.

(Edit)

4. Usage

See the DocString and " if __name__ == '__main__' " part of source code. And you can run this script directly as following:

python ExtMainText.py HtmlFileName

[$[Get Code]]

As a result, this script will extract the main text of your input file using default threshold 0.5, and print out the main text under plain text.

Please pay attention to the encoding of input html string. Since 0.2a version, ExtMainText only accept unicode string as input. I.e, you have to pre-decode html documents before calling ExtMainText.

(Edit)

5. Reference

Introduce artificial algorithms into this approach will lead to more accurate results. See The Easy Way to Extract Useful Text from Arbitrary HTML. Thanks lanphaday mentioned such good article!

Elias' Personal Web Site

ExtMainText -- Extract main text from html document

1. Introduction

2. Download

3. Change Log

4. Usage

5. Reference