Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to latest version of cheerio to avoid vulnerability in nth-check library #461

Closed
wants to merge 1 commit into from

Conversation

brianlukoff
Copy link

This addresses a security vulnerability with the nth-check library that the earlier version of cheerio depended on.

@s0ph1e
Copy link
Member

s0ph1e commented Oct 15, 2021

Hi @brianlukoff
Thank you for your contribution

As I can see you're updating major version of the library which most probably contains breaking changed. Did you try to run tests npm test locally? Do they pass?

I see that tests on windows (appveyor) fail

@brianlukoff
Copy link
Author

They are all passing for me:

vader:node-website-scraper brian$ npm test

> website-scraper@4.2.3 test /private/tmp/node-website-scraper
> c8 --all --reporter=text --reporter=lcov mocha --recursive --timeout 7000 ./test/unit/ ./test/functional && npm run eslint



  FilenameGenerator: bySiteStructure
    ✓ should return the normalized relative path of the resource url
    ✓ should remove . and .. from path
    ✓ should replace the colon, for url with port number
    ✓ should replace not allowed characters from filename
    ✓ should replace not allowed characters from path
    ✓ should add the defaultFilename to the path, for html resources without extension
    ✓ should add the defaultFilename to the path, for html resources with wrong extension
    ✓ should normalize to safe relative paths, without ..
    ✓ should shorten filename
    ✓ should shorten filename if resource is html without ext and default name is too long
    ✓ should return decoded filepath
    ✓ should keep query strings

  FilenameGenerator: byType
    ✓ should return resource filename
    ✓ should return url-based filename if resource has no filename
    ✓ should return url-based filename if resource has empty filename
    ✓ should add missed extensions for html resources
    ✓ should add missed extensions for css resources
    ✓ should add missed extensions for js resources
    ✓ should not add missed extensions for other resources
    ✓ should return filename with correct subdirectory
    ✓ should return filename with correct subdirectory when string cases are different
    ✓ should return different filename if desired filename is occupied
    ✓ should return different filename if desired filename is occupied N times
    ✓ should shorten filename
    ✓ should return different short filename if first short filename is occupied
    ✓ should return decoded url-based filename
    ✓ should remove not allowed characters from filename

  request
    ✓ should call request with correct params
    ✓ should add referer header if referer param was passed
    ✓ should call afterResponse with correct params
    ✓ should return object with url, body, mimeType properties
    ✓ should return mimeType = null if content-type header was not found in response
    transformResult from afterResponse
      ✓ should return object with body and metadata properties
      ✓ should return with metadata == null if metadata is not defined
      ✓ should transform string result
      ✓ should be rejected if wrong result (no string nor object) returned

  ResourceHandler: Css
    ✓ should call downloadChildrenResources and set returned text to resource

  ResourceHandler: Html
    ✓ should not encode text to html entities
    ✓ should not update attribute names to lowercase
    ✓ should call downloadChildrenResources for each source
    ✓ should not call downloadChildrenResources if source attr is empty
    ✓ should use correct path containers based on tag
    ✓ should remove SRI check for loaded resources
    ✓ should use html entities for updated attributes
    constructor
      sources
        ✓ should initialize sources if updateMissingSources was not passed
        ✓ should initialize sources if updateMissingSources = false
        ✓ should initialize sources if updateMissingSources = true
        ✓ should initialize sources if updateMissingSources is array of sources
        ✓ should initialize sources without duplicates if updateMissingSources is array of sources
    <base> tag
      ✓ should remove base tag from text and update resource url for absolute href
      ✓ should remove base tag from text and update resource url for relative href
      ✓ should not remove base tag if it doesn't have href attribute

  ResourceHandler
    constructor
      ✓ should pick supported options
      ✓ should set requestResource
      ✓ should init specific resource handlers
    #getResourceHandler
      ✓ should return css handler if file has css type
      ✓ should return html handler if file has html type
      ✓ should return null if file has other type
    #handleResource
      ✓ should call getResourceHandler and execute specific resource handler
      ✓ should call getResourceHandler and return resolved promise if no specific handler found
    #downloadChildrenResources
      ✓ should not call requestResource if no paths in text
      ✓ should call requestResource once with correct params
      ✓ should call requestResource for each found source with correct params
      ✓ should update paths in text with local files returned by requestResource
      ✓ should not update paths in text, for which requestResource returned null
      ✓ should wait for all children promises fulfilled and then return updated text
      hash in urls
        ✓ should keep hash in urls
      prettifyUrls
        ✓ should not prettifyUrls by default
        ✓ should prettifyUrls if specified

  PathsContainer: CssText
    constructor
      ✓ should set text to empty string if nothing passed
    #getPaths
      ✓ should return paths from url()
      ✓ should return paths from @import
      ✓ should return paths from @import url()
    #updateText
      ✓ should update paths in text
      ✓ should update all duplicated paths in text
      ✓ should update only completely equal paths (should not update partially matched)
      ✓ should update only specified paths

  PathsContainer: HtmlCommonTag
    constructor
      ✓ should set text to empty string if nothing passed
    #getPaths
      ✓ should return paths
      ✓ should not return path with same-page id
      ✓ should return path with other-page id
      ✓ should not return path is uri schema is not supported (mailto: skype: etc)
    #updateText
      ✓ should update paths in text

  PathsContainer: HtmlImgSrcSetTag
    constructor
      ✓ should set text to empty string if nothing passed
    #getPaths
      ✓ should return paths from srcset
    #updateText
      ✓ should update paths in text
      ✓ should update all duplicated paths in text
      ✓ should update only specified paths

  Resource
    #createChild
      ✓ should return Resource
      ✓ should set correct url and filename
      ✓ should set parent
      ✓ should set depth

  Scraper initialization
    defaultFilename
      ✓ should use default defaultFilename if no defaultFilename were passed
      ✓ should use defaultFilename sources if defaultFilename were passed
    sources
      ✓ should use default sources if no sources were passed
      ✓ should use passed sources if sources were passed
      ✓ should extend sources if recursive flag is set
    subdirectories
      ✓ should use default subdirectories if no subdirectories were passed
      ✓ should convert extensions to lower case
      ✓ should use passed subdirectories if subdirectories were passed
      ✓ should use null if null was passed
    request
      ✓ should use default request if no request were passed
      ✓ should merge default and passed objects if request were passed
      ✓ should override existing properties if request were passed
    resourceHandler
      ✓ should create resourceHandler with correct params
    urls
      ✓ should create an Array of urls if string was passed
    resources
      ✓ should create Resource object for each url
      ✓ should use urls filename
      ✓ should use default filename if no url filename was provided
    actions
      ✓ should add empty actions when no plugins
      ✓ should add actions when plugin set
      ✓ should add actions when multiple plugins set
      ✓ should throw error if plugin has wrong action
    mandatory actions
      ✓ should add mandatory actions - saveResource and generateFilename

  Scraper
    #loadResource
      ✓ should add different resource to the map
      ✓ should not add the same resource twice
    #saveResource
      ✓ should call handleError on error
    #requestResource
      ✓ should call handleError on error
      ✓ should update resource data with data returned from request
      url filtering
        ✓ should request the resource if the urlFilter returns true
        ✓ should return promise resolved with null if the urlFilter returns false
      depth filtering
        ✓ should request the resource if the maxDepth option is not set
        ✓ should request the resource if maxDepth is set and resource depth is less than maxDept
        ✓ should request the resource if maxDepth is set and resource depth is equal to maxDept
        ✓ should return null if maxDepth is set and resource depth is greater than maxDepth
    #handleError
      ✓ should ignore error and return resolved promise if ignoreErrors option is true
      ✓ should return rejected promise if ignoreErrors option is false
    #scrape
      ✓ should call load
      ✓ should call errorCleanup on error
      ✓ should return array of objects with url, filename and children
    #runActions
      ✓ should run all actions with correct params and save result from last
      ✓ should fail if one of actions fails
      ✓ should return passed params as result if no actions to run
    export defaults
      ✓ should export defaults
    export plugins
      ✓ should export default plugins
    default saveResource plugin
      ✓ should create directory on scrape
      ✓ should save resource to FS
      ✓ should remove directory on error
      ✓ should return error if no directory
      ✓ should return error if empty directory passed
      ✓ should return error if incorrect directory passed
      ✓ should return error if existing directory passed
    default generateFilename plugins
      ✓ should use byType plugin if filenameGenerator option is set
      ✓ should use bySiteStructure plugin if filenameGenerator option is set
      ✓ should ignore filenameGenerator option if function passed

  NormalizedUrlMap
    #get
      ✓ should find nothing if no value with same url was set
      ✓ should return value with same url

  Utils
    #isUrl(url)
      ✓ should return true if url starts with "http[s]://"
      ✓ should return true if url starts with "//"
      ✓ should return false if url starts neither with "http[s]://" nor "//"
    #getUrl(url, path)
      ✓ should return url + path if path is not url
      ✓ should return path if it is url
      ✓ should use the protocol from the url, if the path is a protocol-less url
    #getUnixPath(path)
      ✓ should convert to unix format for windows
      ✓ should return unconverted path for unix
    #getFilenameFromUrl(url)
      ✓ should return last path item as filename & trim all after first ? or #
      ✓ should return unconverted filename if there are no ?,#
      ✓ should decode escaped chars
    #getFilepathFromUrl
      ✓ should return empty sting if url has no pathname
      ✓ should return path if url has pathname
      ✓ should return path including filename if url has pathname
      ✓ should not contain trailing slash
      ✓ should normalize slashes
      ✓ should decode escaped chars
      ✓ should return path as is if url is malformed
    #getHashFromUrl
      ✓ should return hash from url
      ✓ should return empty string if url doesn't contain hash
    #getRelativePath
      ✓ should return relative path
      ✓ should escape path components with encodeURIComponent
      ✓ should also escape ['()]
    #shortenFilename
      ✓ should leave file with length < 255 as is
      ✓ should shorten file with length = 255
      ✓ should shorten file with length > 255
      ✓ should shorten file with length = 255 and keep extension
      ✓ should shorten file with length > 255 and keep extension
      ✓ should shorten file with length > 255 to have basename length 20 chars
    #prettifyFilename
      ✓ should delete default filename if filename === defaultFilename
      ✓ should delete default filename if filename ends with defaultFilename
      ✓ should not prettify if defaultFilename is part of filename
    #isUriSchemaSupported
      ✓ should return false for mailto:
      ✓ should return false for javascript:
      ✓ should return false for skype:
      ✓ should return true for http:
      ✓ should return true for https:
      ✓ should return true for relative paths
    #decodeHtmlEntities
      ✓ should return empty string if not string passed
      ✓ should return decoded text if string passed
    #urlsEqual
      ✓ should return false for /path and /path/
    #normalizeUrl
      ✓ should return original url if it is malformed

  Functional: base
    ✓ should load multiple urls to single directory with all specified sources (77ms)
    ✓ should load multiple urls to single directory with all specified sources with bySiteStructureFilenameGenerator (69ms)

  Functional: check it works
    ✓ should work with promise

  Functional: onResourceSaved and onResourceError callbacks in plugin
    ✓ should call onResourceSaved callback and onResourceError callback if ignoreErrors = true
    ✓ should call onResourceError callback if ignoreErrors = false

  Functional circular dependencies
    ✓ should correctly load files with circular dependency

  Functional: css handling
    ✓ should correctly handle css files, style tags and style attributes and ignore css-like text inside common html tags (38ms)

  Functional: data urls handling
    ✓ should correctly handle html files with data urls in attributes

  Functional error handling
    FS Error
      ✓ should remove directory and immediately reject on fs error if ignoreErrors is false (617ms)
      ✓ should ignore fs error if ignoreErrors is true (619ms)
    Resource Handler Error
      ✓ should remove directory and immediately reject on resource handler error if ignoreErrors is false (622ms)
      ✓ should ignore resource handler error if ignoreErrors is true (623ms)

  Functional: html entities
    ✓ should decode all html-entities found in html files and not encode entities from css file (54ms)

  Functional html id href
    ✓ should ignore same-file paths and update other-file paths

  Functional: maxDepth and maxRecursiveDepth
    ✓ should filter out all resources by depth > maxDepth (41ms)
    ✓ should filter out only anchors by depth > maxRecursiveDepth (42ms)
    ✓ should correctly save same resource with different depth and maxRecursiveDepth

  Functional recursive downloading
    ✓ should follow anchors if recursive flag is set
    ✓ should follow anchors with depth <= maxDepth if recursive flag and maxDepth are set
    ✓ should not follow anchors if recursive flag is not set

  Functional redirects
    ✓ should follow redirects and save resource once if it has different urls
    ✓ should correctly handle relative source in redirected page

  Functional concurrent requests
    ✓ should have maximum concurrent requests == requestConcurrency option

  Functional: afterResponse action in plugin
    ✓ should skip downloading resource if afterResponse returns null

  Functional: customize request options with plugin
    ✓ should use options from request property if no beforeRequest actions
    ✓ should use options returned by beforeRequest action

  Functional: plugin for saving resources
    ✓ should use passed resourceSaver when saving resource
    ✓ should use passed resourceSaver on error

  Functional resources without extensions
    ✓ should load resources without extensions with correct type and wrap with extensions (38ms)

  Functional: update missing sources
    ✓ should not update missing sources by default
    ✓ should update missing sources if missing resource plugin added
    ✓ should update missing sources when source was rejected by urlFilter
    ✓ should update missing sources when source was rejected by maxRecursiveDepth
    ✓ should update missing sources if one of pathContainers path was failed


  223 passing (5s)

---------------------------------------|---------|----------|---------|---------|-------------------
File                                   | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s
---------------------------------------|---------|----------|---------|---------|-------------------
All files                              |   99.85 |    99.06 |     100 |   99.85 |
 node-website-scraper                  |     100 |      100 |     100 |     100 |
  index.mjs                            |     100 |      100 |     100 |     100 |
 node-website-scraper/lib              |     100 |    99.15 |     100 |     100 |
  logger.js                            |     100 |      100 |     100 |     100 |
  request.js                           |     100 |      100 |     100 |     100 |
  resource.js                          |     100 |      100 |     100 |     100 |
  scraper.js                           |     100 |    98.75 |     100 |     100 | 279
 node-website-scraper/lib/config       |     100 |      100 |     100 |     100 |
  defaults.js                          |     100 |      100 |     100 |     100 |
  recursive-sources.js                 |     100 |      100 |     100 |     100 |
  resource-ext-by-type.js              |     100 |      100 |     100 |     100 |
  resource-type-by-ext.js              |     100 |      100 |     100 |     100 |
  resource-type-by-mime.js             |     100 |      100 |     100 |     100 |
  resource-types.js                    |     100 |      100 |     100 |     100 |
 ...ite-scraper/lib/filename-generator |     100 |      100 |     100 |     100 |
  by-site-structure.js                 |     100 |      100 |     100 |     100 |
  by-type.js                           |     100 |      100 |     100 |     100 |
 node-website-scraper/lib/plugins      |     100 |      100 |     100 |     100 |
  ...enamy-by-site-structure-plugin.js |     100 |      100 |     100 |     100 |
  generate-filenamy-by-type-plugin.js  |     100 |      100 |     100 |     100 |
  ...relative-path-reference-plugin.js |     100 |      100 |     100 |     100 |
  index.js                             |     100 |      100 |     100 |     100 |
  save-resource-to-fs-plugin.js        |     100 |      100 |     100 |     100 |
 ...bsite-scraper/lib/resource-handler |     100 |      100 |     100 |     100 |
  index.js                             |     100 |      100 |     100 |     100 |
 ...e-scraper/lib/resource-handler/css |     100 |      100 |     100 |     100 |
  index.js                             |     100 |      100 |     100 |     100 |
 ...-scraper/lib/resource-handler/html |     100 |      100 |     100 |     100 |
  html-source-element.js               |     100 |      100 |     100 |     100 |
  index.js                             |     100 |      100 |     100 |     100 |
 ...b/resource-handler/path-containers |     100 |      100 |     100 |     100 |
  css-text.js                          |     100 |      100 |     100 |     100 |
  html-common-tag.js                   |     100 |      100 |     100 |     100 |
  html-img-srcset-tag.js               |     100 |      100 |     100 |     100 |
 node-website-scraper/lib/utils        |   99.06 |       96 |     100 |   99.06 |
  index.js                             |   98.98 |    95.65 |     100 |   98.98 | 115-116
  normalized-url-map.js                |     100 |      100 |     100 |     100 |
---------------------------------------|---------|----------|---------|---------|-------------------

> website-scraper@4.2.3 eslint /private/tmp/node-website-scraper
> eslint lib/** index.mjs

@s0ph1e
Copy link
Member

s0ph1e commented Oct 16, 2021

Ok, looks like I need to fix tests for unix and run tests locally. I'll get back soon

@s0ph1e
Copy link
Member

s0ph1e commented Oct 18, 2021

I ran tests locally and I have same 12 failing tests as on appveyor (windows) CI
I fixed unix CI for branch 4.x in #462 , I'll reopen the PR to trigger tests run on Github Actions

@s0ph1e s0ph1e closed this Oct 18, 2021
@s0ph1e s0ph1e reopened this Oct 18, 2021
@s0ph1e
Copy link
Member

s0ph1e commented Oct 18, 2021

Hmm, I see same 12 failing tests on CI, looks like they are related to breaking changes in cheerio library.

@brianlukoff are you sure that tests for you pass with cheerio@1.0.0-rc.10? Could you please double check the output of npm ls cheerio command to be sure what exact version you have?

@brianlukoff
Copy link
Author

Yikes - you are right - I was running tests on master instead of 4.x, which was the branch that actually changed.

@s0ph1e
Copy link
Member

s0ph1e commented Oct 18, 2021

@brianlukoff I this it doesn't matter on which branch you try to update cheerio - it will fail on both because of breaking changes. I've tried to update it before several times - and it always failed. I think this update requires understanding what are the breaking changes and making appropriate updates to the code of website-scraper

@brianlukoff
Copy link
Author

Understood -- I did not make any changes to master which is why the tests were still passing there. :)

@s0ph1e
Copy link
Member

s0ph1e commented Dec 24, 2021

Cheerio update will be done in #467, closing this PR

@s0ph1e s0ph1e closed this Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants