Skip to content

Boilerplate Removal and Fulltext Extraction from HTML pages

Notifications You must be signed in to change notification settings

janih/boilerpipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 

Repository files navigation

boilerpipe

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.

Automatically exported from code.google.com/p/boilerpipe

About

Boilerplate Removal and Fulltext Extraction from HTML pages

Resources

Stars

Watchers

Forks

Packages

No packages published