PHP Goose - Article Extractor

Note

This repository has been archived as of 2023-09-05.

Intro

PHP Goose is a port of Goose originally developed in Java and converted to Scala by GravityLabs. Portions have also been ported from the Python port python-goose. Its mission is to take any news article or article type web page and not only extract what is the main body of the article but also all metadata and most probable image candidate.

The extraction goal is to try and get the purest extraction from the beginning of the article for servicing flipboard/pulse type applications that need to show the first snippet of a web article along with an image.

Goose will try to extract the following information:

Main text of an article
Main image of article
Any YouTube/Vimeo movies embedded in article
Meta Description
Meta tags
Publish Date

The PHP version was rewritten by:

Andrew Scott

Requirement

PHP 7.1 or later
PSR-4 compatible autoloader

The older 0.x versions with PHP 5.5+ support are still available under releases.

Install

This library is designed to be installed via Composer.

Add the dependency into your projects composer.json.

{
  "require": {
    "scotteh/php-goose": "^1.0"
  }
}

Download the composer.phar

curl -sS https://getcomposer.org/installer | php

Install the library.

php composer.phar install

Autoloading

This library requires an autoloader, if you aren't already using one you can include Composers autoloader.

require('vendor/autoload.php');

Usage

use \Goose\Client as GooseClient;

$goose = new GooseClient();
$article = $goose->extractContent('http://url.to/article');

$title = $article->getTitle();
$metaDescription = $article->getMetaDescription();
$metaKeywords = $article->getMetaKeywords();
$canonicalLink = $article->getCanonicalLink();
$domain = $article->getDomain();
$tags = $article->getTags();
$links = $article->getLinks();
$videos = $article->getVideos();
$articleText = $article->getCleanedArticleText();
$entities = $article->getPopularWords();
$image = $article->getTopImage();
$allImages = $article->getAllImages();

Configuration

All config options are not required and are optional. Default (fallback) values have been used below.

use \Goose\Client as GooseClient;

$goose = new GooseClient([
    // Language - Selects common word dictionary
    //   Supported languages (ISO 639-1):
    //     ar, cs, da, de, en, es, fi, fr, hu, id, it, ja,
    //     ko, nb, nl, no, pl, pt, ru, sv, vi, zh
    'language' => 'en',
    // Minimum image size (bytes)
    'image_min_bytes' => 4500,
    // Maximum image size (bytes)
    'image_max_bytes' => 5242880,
    // Minimum image size (pixels)
    'image_min_width' => 120,
    // Maximum image size (pixels)
    'image_min_height' => 120,
    // Fetch best image
    'image_fetch_best' => true,
    // Fetch all images
    'image_fetch_all' => false,
    // Guzzle configuration - All values are passed directly to Guzzle
    //   See: http://guzzle.readthedocs.io/en/stable/request-options.html
    'browser' => [
        'timeout' => 60,
        'connect_timeout' => 30
    ]
]);

Licensing

PHP Goose is licensed by Gravity.com under the Apache 2.0 license, see the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 238 Commits
resources		resources
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
phpunit.xml.dist		phpunit.xml.dist

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PHP Goose - Article Extractor

Note

Intro

Requirement

Install

Autoloading

Usage

Configuration

Licensing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 24

Uh oh!

Languages

License

scotteh/php-goose

Folders and files

Latest commit

History

Repository files navigation

PHP Goose - Article Extractor

Note

Intro

Requirement

Install

Autoloading

Usage

Configuration

Licensing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 24

Uh oh!

Languages

Packages