Extract

Extract just the content from a web page.

Extract is a wrapper to turn the Mercury Parser into a web service.

Why?

Mercury already offers an API component, meant to be deployed to AWS Lambda. There are a few reasons why this exists as an alternative.

Deploy elsewhere. Extract is a vanilla Node.js app, that is meant to run in a VM, and has no platform specific dependencies.
Built-in authorization system.
Performance. In my experience, running it on a VM has been faster than the lambda version.

Here's a graph where you can see a decrease in average response time around the 17. Feb mark. This is when Feedbin switched from the lambda hosted version, to extract running on a VPS.

Installation

Install Node.js and npm.

Clone extract

git clone https://github.com/feedbin/extract.git

Install the dependencies.
```
cd extract
npm install
```
Run the server
```
node app/server.js
```
Alternatively, extract includes an ecosystem.config.js to use with pm2. You could use this in production.
```
npm install --global pm2
pm2 start ecosystem.config.js
```

Usage

Extract has a simple, file-based system for creating users and secret keys. This allows users to be added/removed while the system is running. In the ./users directory, the filename is the username and the contents is the secret key. To make a new user, run the following:

cd extract
mkdir users

# use your own secret key and username
echo "SECRET_KEY" > users/USERNAME

Once a username and password has been created, you can make a request.

An example request looks like:

http://localhost:3000/parser/:username/:signature?base64_url=:base64_url

The parts that you need are:

username your username
signature the hexadecimal HMAC-SHA1 signature of the URL you want to parse
base64_url base64 encoded version of the URL you want to parse

The URL is base64-encoded to avoid any issues in the way different systems encode URLs. It must use the RFC 4648 url-safe variant with no newlines.

If your platform does not offer a URL safe base64 option, you can replicate it. First create the base64 encoded string. Then replace the following characters:

+ => -
/ => _
\n => ""

Here's a sample implementation in ruby. You can use this as a reference for matching your implementation.

require "uri"
require "openssl"
require "base64"

username = "username"
secret = "secret"
host = "localhost"
port = 3000
url = "https://feedbin.com/blog/2018/09/11/private-by-default/"

digest = OpenSSL::Digest.new("sha1")
signature = OpenSSL::HMAC.hexdigest(digest, secret, url)

base64_url = Base64.urlsafe_encode64(url).gsub("\n", "")

URI::HTTPS.build({
  host: host,
  port: port,
  path: "/parser/#{username}/#{signature}",
  query: "base64_url=#{base64_url}"
}).to_s

The above example would produce:

https://localhost:3000/parser/username/e4696f8630bb68c21d77a9629ce8d063d8e5f81c?base64_url=aHR0cHM6Ly9mZWVkYmluLmNvbS9ibG9nLzIwMTgvMDkvMTEvcHJpdmF0ZS1ieS1kZWZhdWx0Lw==

With the output:

{
    "title": "Private by Default",
    "author": null,
    "date_published": "2018-09-11T00:00:00.000Z",
    "dek": null,
    "lead_image_url": "https://assets.feedbin.com/assets-site/blog/2018-09-11/embed-3f43088538ae5ed7e585c00013adc13a915fd35de31990b3081a085b963ed7dd.png",
    "content": "<div>content</div>",
    "next_page_url": null,
    "url": "https://feedbin.com/blog/2018/09/11/private-by-default/",
    "domain": "feedbin.com",
    "excerpt": "September 11, 2018 by Ben Ubois I want Feedbin to be the opposite of Big Social. I think people should have the right not to be tracked on the Internet and Feedbin can help facilitate that. Since&hellip;",
    "word_count": 787,
    "direction": "ltr",
    "total_pages": 1,
    "rendered_pages": 1
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Extract

Why?

Installation

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Extract

Why?

Installation

Usage