Mercury Parser - Extracting content from chaos

Mercury Parser extracts the bits that humans care about from any URL you give it. That includes article content, titles, authors, published dates, excerpts, lead images, and more.

Mercury Parser allows you to easily create custom parsers using simple JavaScript and CSS selectors. This allows you to proactively manage parsing and migration edge cases. There are many examples available along with documentation.

How? Like this.

Installation

# If you're using yarn
yarn add @jocmp/mercury-parser

# If you're using npm
npm install @jocmp/mercury-parser

Usage

import Parser from '@jocmp/mercury-parser';

Parser.parse(url).then(result => console.log(result));

// NOTE: When used in the browser, you can omit the URL argument
// and simply run `Parser.parse()` to parse the current page.

The result looks like this:

{
  "title": "Thunder (mascot)",
  "content": "... <p><b>Thunder</b> is the <a href=\"https://en.wikipedia.org/wiki/Stage_name\">stage name</a> for the...",
  "author": "Wikipedia Contributors",
  "date_published": "2016-09-16T20:56:00.000Z",
  "lead_image_url": null,
  "dek": null,
  "next_page_url": null,
  "url": "https://en.wikipedia.org/wiki/Thunder_(mascot)",
  "domain": "en.wikipedia.org",
  "excerpt": "Thunder Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos",
  "word_count": 4677,
  "direction": "ltr",
  "total_pages": 1,
  "rendered_pages": 1
}

If Parser is unable to find a field, that field will return null.

`parse()` Options

Content Formats

By default, Mercury Parser returns the content field as HTML. However, you can override this behavior by passing in options to the parse function, specifying whether or not to scrape all pages of an article, and what type of output to return (valid values are 'html', 'markdown', and 'text'). For example:

Parser.parse(url, { contentType: 'markdown' }).then(result =>
  console.log(result)
);

This returns the the page's content as GitHub-flavored Markdown:

"content": "...**Thunder** is the [stage name](https://en.wikipedia.org/wiki/Stage_name) for the..."

Custom Request Headers

You can include custom headers in requests by passing name-value pairs to the parse function as follows:

Parser.parse(url, {
  headers: {
    Cookie: 'name=value; name2=value2; name3=value3',
    'User-Agent':
      'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1',
  },
}).then(result => console.log(result));

Pre-fetched HTML

You can use Mercury Parser to parse custom or pre-fetched HTML by passing an HTML string to the parse function as follows:

Parser.parse(url, {
  html:
    '<html><body><article><h1>Thunder (mascot)</h1><p>Thunder is the stage name for the horse who is the official live animal mascot for the Denver Broncos</p></article></body></html>',
}).then(result => console.log(result));

Note that the URL argument is still supplied, in order to identify the web site and use its custom parser, if it has any, though it will not be used for fetching content.

The command-line parser

Mercury Parser also ships with a CLI, meaning you can use it from your command line like so:

# Install Mercury Parser globally
yarn global add @jocmp/mercury-parser
#   or
npm -g install @jocmp/mercury-parser

# Then
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source

# Pass optional --format argument to set content type (html|markdown|text)
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --format=markdown

# Pass optional --header.name=value arguments to include custom headers in the request
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --header.Cookie="name=value; name2=value2; name3=value3" --header.User-Agent="Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"

# Pass optional --extend argument to add a custom type to the response
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend credit="p:last-child em"

# Pass optional --extend-list argument to add a custom type with multiple matches
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list categories=".meta__tags-list a"

# Get the value of attributes by adding a pipe to --extend or --extend-list
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --extend-list links=".body a|href"

# Pass optional --add-extractor argument to add a custom extractor at runtime.
postlight-parser https://postlight.com/trackchanges/mercury-goes-open-source --add-extractor ./src/extractors/fixtures/postlight.com/index.js

License

Licensed under either of the below, at your preference:

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

Contributing

For details on how to contribute to Mercury Parser, including how to write a custom content extractor for any site, see CONTRIBUTING.md

Unless it is explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual licensed as above without any additional terms or conditions.

Name		Name	Last commit message	Last commit date
Latest commit History 609 Commits
.circleci		.circleci
.github		.github
assets		assets
dist		dist
fixtures		fixtures
nock		nock
scripts		scripts
src		src
.agignore		.agignore
.babelrc		.babelrc
.eslintignore		.eslintignore
.eslintrc		.eslintrc
.gitattributes		.gitattributes
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierignore		.prettierignore
.prettierrc		.prettierrc
.remarkrc		.remarkrc
.tool-versions		.tool-versions
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE.md		RELEASE.md
bumpver.toml		bumpver.toml
cli.js		cli.js
karma.conf.js		karma.conf.js
package.json		package.json
preview		preview
rollup.config.esm.js		rollup.config.esm.js
rollup.config.js		rollup.config.js
rollup.config.web.js		rollup.config.web.js
score-move		score-move
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mercury Parser - Extracting content from chaos

How? Like this.

Installation

Usage

`parse()` Options

Content Formats

Custom Request Headers

Pre-fetched HTML

The command-line parser

License

Contributing

About

Releases 2

Sponsor this project

Packages

Contributors 57

Languages

License

jocmp/mercury-parser

Folders and files

Latest commit

History

Repository files navigation

Mercury Parser - Extracting content from chaos

How? Like this.

Installation

Usage

parse() Options

Content Formats

Custom Request Headers

Pre-fetched HTML

The command-line parser

License

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases 2

Sponsor this project

Packages 0

Contributors 57

Languages

`parse()` Options

Packages