feat: pagesix parser #97

janetleekim · 2016-12-20T17:26:30Z

Tests are passing but no image appeared on the preview. Not sure if
that’s an issue.

Tests are passing but no image appeared on the preview. Not sure if that’s an issue.

kev5873 · 2017-02-03T19:41:03Z

src/extractors/custom/pagesix.com/index.js

+    // Is there anything in the content you selected that needs transformed
+    // before it's consumable content? E.g., unusual lazy loaded images
+    transforms: {
+      '#featured-image-wrapper': 'figure',


Same issue as the other PR where I'd like @adampash to verify.

Somehow I missed this. What did you want me to verify, @kev5873?

kev5873 · 2017-02-03T19:46:02Z

src/extractors/custom/pagesix.com/index.js

+  domain: 'pagesix.com',
+
+  supportedDomains: [
+    'nypost.com',


It also appears that nypost.com has a very similar layout, and appears to be compatible with this one, since they are in the same network of sites.

adampash

A few small optimizations.

adampash · 2017-02-06T22:11:05Z

src/extractors/custom/pagesix.com/index.js

+  domain: 'pagesix.com',
+
+  supportedDomains: [
+    'nypost.com',


adampash · 2017-02-06T22:16:32Z

src/extractors/custom/pagesix.com/index.js

+      '.modal-trigger',
+      '.wp-caption-text',
+    ],
+  },


Here's how I'd re-write the content extractor:

content: { selectors: [ ['#featured-image-wrapper', '.entry-content'], '.entry-content', ], // Is there anything in the content you selected that needs transformed // before it's consumable content? E.g., unusual lazy loaded images transforms: { '#featured-image-wrapper': 'figure', '.wp-caption-text': 'figcaption', }, // Is there anything that is in the result that shouldn't be? // The clean selectors will remove anything that matches from // the result clean: [ '.modal-trigger', ], },

So, bit by bit:

selectors: [ ['#featured-image-wrapper', '.entry-content'], '.entry-content', ],

First, by focusing on .entry-content instead of .article-header, you don't have to clean near as much from the page (see the much smaller clean section). You can still get the lead image by doing a multi-element selector for the content — this says get the #featured-image-wrapper, then get the .entry-content, and put them together.

transforms: { '#featured-image-wrapper': 'figure', '.wp-caption-text': 'figcaption', },

This transform just changes div#featured-image-wrapper to a figure, which will then work correctly — and then we can also easily transform div.wp-caption-text into a figcaption.

…-parser into feat-pagesix-extractor

dviramontes

lgtm!

janetleekim and others added 2 commits December 20, 2016 12:26

feat: pagesix parser

dce0ec6

Tests are passing but no image appeared on the preview. Not sure if that’s an issue.

Merge branch 'master' into feat-pagesix-extractor

3ed26fd

spiffytoy added custom parser needs-review and removed custom parser needs-review labels Jan 31, 2017

kev5873 self-assigned this Feb 3, 2017

kev5873 added 2 commits February 3, 2017 13:49

Merge branch 'master' into feat-pagesix-extractor

49f8104

fix: adds lead image in, adds support for nypost.com

ff9c92e

kev5873 requested review from adampash and dviramontes February 3, 2017 19:40

kev5873 reviewed Feb 3, 2017

View reviewed changes

kev5873 assigned adampash Feb 6, 2017

fix: update transform

82d9638

kev5873 force-pushed the feat-pagesix-extractor branch from 7798ac4 to 82d9638 Compare February 6, 2017 20:00

Merge branch 'master' into feat-pagesix-extractor

c23af7f

adampash suggested changes Feb 6, 2017

View reviewed changes

kev5873 added 3 commits February 6, 2017 17:46

fix: use multi-selectors

fd0c775

Merge branch 'master' into feat-pagesix-extractor

fc7e5da

Merge branch 'feat-pagesix-extractor' of github.com:postlight/mercury…

85e62bf

…-parser into feat-pagesix-extractor

dviramontes approved these changes Feb 7, 2017

View reviewed changes

kev5873 added 4 commits February 7, 2017 13:30

Merge branch 'master' into feat-pagesix-extractor

d667dce

Merge branch 'master' into feat-pagesix-extractor

7e5314e

Merge branch 'master' into feat-pagesix-extractor

f93a1ab

Merge branch 'master' into feat-pagesix-extractor

e3f4174

kev5873 merged commit beb0b89 into master Feb 7, 2017

kev5873 deleted the feat-pagesix-extractor branch February 7, 2017 22:38

ftrain unassigned kev5873 May 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pagesix parser #97

feat: pagesix parser #97

janetleekim commented Dec 20, 2016

kev5873 Feb 3, 2017

adampash Feb 7, 2017

kev5873 Feb 3, 2017

adampash Feb 6, 2017

adampash left a comment

adampash Feb 6, 2017

adampash Feb 6, 2017

dviramontes left a comment

feat: pagesix parser #97

feat: pagesix parser #97

Conversation

janetleekim commented Dec 20, 2016

kev5873 Feb 3, 2017

Choose a reason for hiding this comment

adampash Feb 7, 2017

Choose a reason for hiding this comment

kev5873 Feb 3, 2017

Choose a reason for hiding this comment

adampash Feb 6, 2017

Choose a reason for hiding this comment

adampash left a comment

Choose a reason for hiding this comment

adampash Feb 6, 2017

Choose a reason for hiding this comment

adampash Feb 6, 2017

Choose a reason for hiding this comment

dviramontes left a comment

Choose a reason for hiding this comment