Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow passing htmlDom #711

Merged
merged 4 commits into from
Jun 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,12 @@ Type: `String`

The HTML markup for extracting the content.

##### htmlDom

Type: `object`

The DOM representation of the HTML markup. When it's not provided, it's get from the `html` parameter.

#### rules

Type: `Array`
Expand Down
3 changes: 2 additions & 1 deletion packages/metascraper/src/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ module.exports = rules => {
return async ({
url,
html = '',
htmlDom,
rules: inlineRules,
validateUrl = true,
...props
Expand All @@ -27,7 +28,7 @@ module.exports = rules => {

return getData({
url,
htmlDom: load(html, { baseURI: url }),
htmlDom: htmlDom ?? load(html, { baseURI: url }),
rules: mergeRules(inlineRules, loadedRules),
...props
})
Expand Down
1,300 changes: 202 additions & 1,098 deletions packages/metascraper/test/integration/bfi/input.html

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@ Generated by [AVA](https://avajs.dev).
> Snapshot 1

{
audio: 'https://podcast-stream.wbez.org/recast/the-pie/20210113155621-ThePie-E04.mp3',
audio: null,
author: null,
date: '2021-01-14T03:00:06.000Z',
date: '2021-02-12T16:09:42.000Z',
description: 'The expanding market influence of tech companies has sparked new fear of an old economic problem – monopoly power. In this episode, Eric Posner and Chad Syverson discuss whether these […]',
image: 'https://bfi.uchicago.edu/wp-content/uploads/2018/11/pie-web-banner_6.png',
lang: 'en',
logo: 'https://bfi.uchicago.edu/wp-content/uploads/2019/03/favicon-228.png',
logo: 'https://bfi.uchicago.edu/wp-content/uploads/2024/03/BFI-Core-Logo-RGB.svg',
publisher: 'BFI',
title: 'The Big Tech Threat? | BFI',
url: 'https://bfi.uchicago.edu/podcast/the-big-tech-threat/',
Expand Down
Binary file not shown.
1,515 changes: 1,053 additions & 462 deletions packages/metascraper/test/integration/los-angeles-times/input.html

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ Generated by [AVA](https://avajs.dev).

{
audio: null,
author: 'Los Angeles Times',
author: 'Tracey Lien',
date: '2016-05-02T10:03:18.000Z',
description: 'Tech start-up Appthority’s office has plush conference rooms, soundproof phone booths, an enormous kitchen and a view of San Francisco Bay. It has ping-pong and foosball tables, beer on tap and 11 types of tea.',
image: 'http://www.trbimg.com/img-572421a4/turbine/la-fi-tn-tech-downturn-20160429',
description: 'Tech start-up Appthority’s office has plush conference rooms, soundproof phone booths, an enormous kitchen and a view of San Francisco Bay.',
image: 'https://ca-times.brightspotcdn.com/dims4/default/78d090f/2147483647/strip/true/crop/2048x1075+0+146/resize/1200x630!/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F07%2F12%2F36025c1a8b6f34ec37234f02980c%2Fla-la-fi-adv-start-up-funding007-jpg-20160429',
lang: 'en',
logo: 'https://www.latimes.com/favicon.ico',
publisher: 'latimes.com',
logo: 'https://www.latimes.com/apple-touch-icon.png',
publisher: 'Los Angeles Times',
title: 'As venture capital dries up, tech start-ups discover frugality',
url: 'http://www.latimes.com/business/technology/la-fi-tn-tech-downturn-20160429-story.html',
url: 'https://www.latimes.com/business/technology/la-fi-tn-tech-downturn-20160429-story.html',
video: null,
}
Binary file not shown.
16 changes: 13 additions & 3 deletions packages/metascraper/test/unit/interface.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ const test = require('ava')

const createMetascraper = require('../..')
const titleRules = require('metascraper-title')()
const { load } = require('cheerio')

test('`url` is required', async t => {
t.plan(9)
Expand Down Expand Up @@ -31,7 +32,7 @@ test('`url` is required', async t => {
}
})

test('Disable URL validation using `validateUrl`', async t => {
test('passing `{ validateUrl: false }`', async t => {
const metascraper = createMetascraper([titleRules])

const html = `
Expand Down Expand Up @@ -66,7 +67,7 @@ test('Disable URL validation using `validateUrl`', async t => {
t.is(metadata.title, 'Document')
})

test('load extra `rules`', async t => {
test('passing `rules`', async t => {
const url = 'https://microlink.io'

const html = `
Expand Down Expand Up @@ -104,7 +105,7 @@ test('load extra `rules`', async t => {
t.is(metadata.foo, 'bar')
})

test('associate test function with rules', async t => {
test('skip `rules` via `test` function', async t => {
const url = 'https://microlink.io'

const html = `
Expand Down Expand Up @@ -148,3 +149,12 @@ test('associate test function with rules', async t => {
t.is(metadata.foo, null)
t.true(isCalled)
})

test('passing `htmlDom`', async t => {
const url = 'https://microlink.io'
const htmlDom = load('<title>htmlDom</title>')
const html = '<title>Original HTML</title>'
const metascraper = createMetascraper([titleRules])
const metadata = await metascraper({ url, htmlDom, html })
t.is(metadata.title, 'htmlDom')
})
Loading