-
-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
single vs double quotes: keep original #355
Comments
Hey @fancydev18 Thanks for reporting issue
|
I dont know how much manipulation there is going on, but recently i found out that jsdom is extremely good at that. I was updating only attributes (form, src, href etc.), but maybe this issue doesnt exist there. |
Hi @pavelloz I'm not sure that it's possible to implement similar functionality with jsdom instead of cheerio. We're using cheerio because it has flexible selectors and we can easily configure them in config. node-website-scraper/lib/config/defaults.js Lines 5 to 35 in 9c9985b
I didn't find a way how to do the same with jsdom. Will take closer look |
Hmm. JSDom is actually parsing html, creating virtual dom, so you can operate on it like you do in the browser. Example: module.exports = (document) => {
const css = document.querySelectorAll('link[rel="stylesheet"]');
css.forEach((el) => {
if (!isEligible(el.href)) return;
el.href = assetify(el.href);
});
}; Where document is a html string (ie. response body) parsed by jsdom. More context: |
I found out how to keep original quotes in cheerio:
I guess linked issue mentions it, so maybe im not finding anything exciting, but i had issues with quotes and this solved it. PS.
~9x faster just by switching jsdom to cheerio. |
Hi @pavelloz
node-website-scraper/lib/resource-handler/html/index.js Lines 82 to 87 in 6988b86
|
What's the current state of this issue? Is this perhaps going to be resolved in v5? If not, how can one help to fix this in v4? Edit: looks like cheeriojs/cheerio#1006 was resolved last year. Time to upgrade cheerio? |
Hey @swissspidy 👋 Currently cheerio update is not planned for v5. I was waiting for cheerio version 1 (not a release candidate) to be released, but looks like it will not happen soon. There were some attempts to update cheerio to latest version (e.g. #461), but looks like it's not so easy because of breaking changes - some functionality stops working and test fail after the update. Everyone is welcome to contribute - feel free to update cheerio and make tests pass again. I will be happy to help with code reviews / suggestions. Otherwise I will try to do it by myself when I have time |
Now I'm trying to update cheerio again in #467 |
Issue with single and double quotes will be fixed in next release v5. I hope to publish it next week |
Configuration
version: [result of
npm ls website-scraper --depth 0
command]website-scraper@4.0.1
options: [provide your full options object]
Description
[Description of the issue]
Some HTML attributes have single quotes in the original page, eg:
When saving pages, scraper changes that to double quotes, which breaks the JSON:
Expected behavior: [What you expect to happen]
Preserve original quoting style. They know why they chose it.
Actual behavior: [What actually happens]
Single quotes are changed to double quotes
The text was updated successfully, but these errors were encountered: