Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler uses broken regexes #1918

Closed
Rich-Harris opened this issue Jul 15, 2021 · 3 comments · Fixed by #1984
Closed

Crawler uses broken regexes #1918

Rich-Harris opened this issue Jul 15, 2021 · 3 comments · Fixed by #1984
Labels
bug Something isn't working
Milestone

Comments

@Rich-Harris
Copy link
Member

Describe the bug

The prerenderer uses these regexes for crawling pages:

/** @param {string} attrs */
function get_href(attrs) {
const match = /([\s'"]|^)href\s*=\s*(?:"(.*?)"|'(.*?)'|([^\s>]*))/.exec(attrs);
return match && (match[1] || match[2] || match[3]);
}
/** @param {string} attrs */
function get_src(attrs) {
const match = /([\s'"]|^)src\s*=\s*(?:"(.*?)"|'(.*?)'|([^\s>]*))/.exec(attrs);
return match && (match[1] || match[2] || match[3]);
}
/** @param {string} attrs */
function get_srcset_urls(attrs) {
const results = [];
// Note that the srcset allows any ASCII whitespace, including newlines.
const match = /([\s'"]|^)srcset\s*=\s*(?:"(.*?)"|'(.*?)'|([^\s>]*))/s.exec(attrs);
if (match) {
const attr_content = match[1] || match[2] || match[3];
// Parse the content of the srcset attribute.
// The regexp is modelled after the srcset specs (https://html.spec.whatwg.org/multipage/images.html#srcset-attribute)
// and should cover most reasonable cases.
const regex = /\s*([^\s,]\S+[^\s,])\s*((?:\d+w)|(?:-?\d+(?:\.\d+)?(?:[eE]-?\d+)?x))?/gm;
let sub_matches;
while ((sub_matches = regex.exec(attr_content))) {
results.push(sub_matches[1]);
}
}
return results;
}

But this code...

function get_href(attrs) {
  const match = /([\s'"]|^)href\s*=\s*(?:"(.*?)"|'(.*?)'|([^\s>]*))/.exec(attrs);
  return match && (match[1] || match[2] || match[3]);
}

...will return ' ' if called with a string like class="foo" href="/blah", rather than '/blah'. If the attributes are reversed, it works correctly.

Changing each regex like so, so that the match indices are correct, seems to fix it:

-const match = /([\s'"]|^)href\s*=\s*(?:"(.*?)"|'(.*?)'|([^\s>]*))/.exec(attrs);
+const match = /(?:[\s'"]|^)href\s*=\s*(?:"(.*?)"|'(.*?)'|([^\s>]*))/.exec(attrs);

Reproduction

If I get a chance to do this later I will... (under the cosh right now)

Logs

No response

System Info

System:
    OS: macOS 10.15.7
    CPU: (16) x64 Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
    Memory: 1.09 GB / 64.00 GB
    Shell: 5.7.1 - /bin/zsh
  Binaries:
    Node: 12.21.0 - ~/.nvm/versions/node/v12.21.0/bin/node
    Yarn: 1.22.10 - ~/.nvm/versions/node/v12.21.0/bin/yarn
    npm: 6.14.11 - ~/.nvm/versions/node/v12.21.0/bin/npm
  Browsers:
    Chrome: 91.0.4472.164
    Firefox: 89.0.2
    Safari: 13.1.3
  npmPackages:
    @sveltejs/adapter-static: ^1.0.0-next.13 => 1.0.0-next.13 
    @sveltejs/kit: ^1.0.0-next.127 => 1.0.0-next.127 
    svelte: ^3.38.2 => 3.38.2

Severity

annoyance

Additional Information

I think we need to add a new option for the 'severity' dropdown — this isn't blocking, but it's more than an annoyance, it's a serious bug

@benmccann benmccann added the bug Something isn't working label Jul 18, 2021
@benmccann
Copy link
Member

We should add some tests for these functions. I think this is the third bug that's been present in them

@jsudelko
Copy link

A thought that may or may not have merit: Have you considered parsing these values out instead of using regexes (I feel tempted to write regices because vertex -> vertices).

  1. Parsing rendered HTML seems the most straight forward. But just how much this affects perf as page-count increases, I don't know.
  2. I don't know whether it's at all possible, or worth the implementation effort to surface these values somehow in Svelte's parsing/render mechanism. If it were somehow possible, I assume said mechanism might eventually prove itself useful for other usecases/tooling beyond just getting hrefs.

@benmccann
Copy link
Member

This bug was introduced in #1743

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants