Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single vs double quotes: keep original #355

Closed
fancydev18 opened this issue Jun 5, 2019 · 10 comments
Closed

single vs double quotes: keep original #355

fancydev18 opened this issue Jun 5, 2019 · 10 comments
Assignees
Labels

Comments

@fancydev18
Copy link

Configuration

version: [result of npm ls website-scraper --depth 0 command]
website-scraper@4.0.1

options: [provide your full options object]

const options = {
        urls: ['https://htmlstream.com/preview/front-v2.7.0/html/pages/pricing.html'],
        directory: 'scrap-1/',
        subdirectories: [
            { directory: 'img', extensions: ['.jpg', '.jpeg', '.png', '.svg', '.gif'] },
            { directory: 'js', extensions: ['.js'] },
            { directory: 'css', extensions: ['.css'] }
        ]
    }

Description

[Description of the issue]

Some HTML attributes have single quotes in the original page, eg:

<div class="js-slick-carousel u-slick u-slick--gutters-2 z-index-2"
                 data-slides-show="4"
                 data-adaptive-height="true"
                 data-slides-scroll="1"
                 data-pagi-classes="d-lg-none text-center u-slick__pagination mt-7 mb-0"
                 data-responsive='[{
                   "breakpoint": 1200,
                   "settings": {
                     "slidesToShow": 3
                   }
                 }, ...]'>

When saving pages, scraper changes that to double quotes, which breaks the JSON:


<div class="js-slick-carousel u-slick u-slick--gutters-2 z-index-2" data-slides-show="4" data-adaptive-height="true" data-slides-scroll="1" data-pagi-classes="d-lg-none text-center u-slick__pagination mt-7 mb-0" data-responsive="[{
                   "breakpoint": 1200,
                   "settings": {
                     "slidesToShow": 3
                   }
                 }, ...

Expected behavior: [What you expect to happen]

Preserve original quoting style. They know why they chose it.

Actual behavior: [What actually happens]

Single quotes are changed to double quotes

@s0ph1e
Copy link
Member

s0ph1e commented Jun 6, 2019

Hey @fancydev18

Thanks for reporting issue

website-scraper uses cheerio to manipulate DOM elements, issue cheeriojs/cheerio#1006 causes this behavior

@s0ph1e s0ph1e added the bug label Jun 6, 2019
@pavelloz
Copy link

pavelloz commented Apr 3, 2020

I dont know how much manipulation there is going on, but recently i found out that jsdom is extremely good at that.

I was updating only attributes (form, src, href etc.), but maybe this issue doesnt exist there.

@s0ph1e
Copy link
Member

s0ph1e commented Apr 10, 2020

Hi @pavelloz
Thank you for suggestion.

I'm not sure that it's possible to implement similar functionality with jsdom instead of cheerio.

We're using cheerio because it has flexible selectors and we can easily configure them in config.
For example:

{ selector: 'style' },
{ selector: '[style]', attr: 'style' },
{ selector: 'img', attr: 'src' },
{ selector: 'img', attr: 'srcset' },
{ selector: 'input', attr: 'src' },
{ selector: 'object', attr: 'data' },
{ selector: 'embed', attr: 'src' },
{ selector: 'param[name="movie"]', attr: 'value' },
{ selector: 'script', attr: 'src' },
{ selector: 'link[rel="stylesheet"]', attr: 'href' },
{ selector: 'link[rel*="icon"]', attr: 'href' },
{ selector: 'svg *[xlink\\:href]', attr: 'xlink:href' },
{ selector: 'svg *[href]', attr: 'href' },
{ selector: 'picture source', attr: 'srcset' },
{ selector: 'meta[property="og\\:image"]', attr: 'content' },
{ selector: 'meta[property="og\\:image\\:url"]', attr: 'content' },
{ selector: 'meta[property="og\\:image\\:secure_url"]', attr: 'content' },
{ selector: 'meta[property="og\\:audio"]', attr: 'content' },
{ selector: 'meta[property="og\\:audio\\:url"]', attr: 'content' },
{ selector: 'meta[property="og\\:audio\\:secure_url"]', attr: 'content' },
{ selector: 'meta[property="og\\:video"]', attr: 'content' },
{ selector: 'meta[property="og\\:video\\:url"]', attr: 'content' },
{ selector: 'meta[property="og\\:video\\:secure_url"]', attr: 'content' },
{ selector: 'video', attr: 'src' },
{ selector: 'video source', attr: 'src' },
{ selector: 'video track', attr: 'src' },
{ selector: 'audio', attr: 'src' },
{ selector: 'audio source', attr: 'src' },
{ selector: 'audio track', attr: 'src' },
{ selector: 'frame', attr: 'src' },
{ selector: 'iframe', attr: 'src' }

I didn't find a way how to do the same with jsdom. Will take closer look

@pavelloz
Copy link

pavelloz commented Apr 10, 2020

Hmm. JSDom is actually parsing html, creating virtual dom, so you can operate on it like you do in the browser.

Example:

module.exports = (document) => {
  const css = document.querySelectorAll('link[rel="stylesheet"]');

  css.forEach((el) => {
    if (!isEligible(el.href)) return;

    el.href = assetify(el.href);
  });
};

Where document is a html string (ie. response body) parsed by jsdom.

More context:
https://github.com/mdyd-dev/posify/blob/master/src/lib/replace-urls/index.js

@pavelloz
Copy link

pavelloz commented Jun 20, 2020

I found out how to keep original quotes in cheerio:

  const $ = cheerio.load(fileContent, { decodeEntities: false });

I guess linked issue mentions it, so maybe im not finding anything exciting, but i had issues with quotes and this solved it.

PS.
Nevermind about jsdom. I just had to switch to cheerio in my project because it was so slow at large sites.

cheerio
    posify urls  7.52s user 0.67s system 150% cpu 5.431 total 
jsdom
    posify urls  55.13s user 1.97s system 129% cpu 44.181 total

~9x faster just by switching jsdom to cheerio.

@aivus
Copy link
Member

aivus commented Jun 20, 2020

Hi @pavelloz

decodeEntities: false is already used by the package:

function loadTextToCheerio (text) {
return cheerio.load(text, {
decodeEntities: false,
lowerCaseAttributeNames: false,
});
}

@swissspidy
Copy link

swissspidy commented Dec 16, 2021

What's the current state of this issue? Is this perhaps going to be resolved in v5? If not, how can one help to fix this in v4?

Edit: looks like cheeriojs/cheerio#1006 was resolved last year. Time to upgrade cheerio?

@s0ph1e
Copy link
Member

s0ph1e commented Dec 24, 2021

Hey @swissspidy 👋

Currently cheerio update is not planned for v5.

I was waiting for cheerio version 1 (not a release candidate) to be released, but looks like it will not happen soon.

There were some attempts to update cheerio to latest version (e.g. #461), but looks like it's not so easy because of breaking changes - some functionality stops working and test fail after the update.

Everyone is welcome to contribute - feel free to update cheerio and make tests pass again. I will be happy to help with code reviews / suggestions. Otherwise I will try to do it by myself when I have time

@s0ph1e
Copy link
Member

s0ph1e commented Dec 24, 2021

Now I'm trying to update cheerio again in #467
So far upgrade looks fine, hopefully it can fix this bug

s0ph1e added a commit that referenced this issue Dec 24, 2021
@s0ph1e s0ph1e closed this as completed in 0d5e8a2 Dec 24, 2021
@s0ph1e
Copy link
Member

s0ph1e commented Dec 24, 2021

Issue with single and double quotes will be fixed in next release v5. I hope to publish it next week

@s0ph1e s0ph1e self-assigned this Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants