Support filtering / limiting scope of URLs #52

emersonthis · 2020-10-30T21:03:14Z

Andy D:

One thing I didn't see in the docs is whether it was possible to limit the depth or number of pages in the crawl - on some sites (retailers / publishers) I could see the crawl size getting pretty large

Others have asked about maybe some kind of flag to filter urls. All seem to be thinking about the same use-case: more efficiently analyzing chunks of a big site.

calebeby · 2020-11-02T22:48:22Z

I like this idea! Should URLs passed via the flag be excluded both from crawling and lighthouse-ing? Or just from lighthouse-ing? i.e. if a URL is excluded, should the crawler discover pages that are linked from the excluded page? @emersonthis

emersonthis · 2020-11-03T00:01:58Z

Great question. My gut is that we'd want to filter the reports but "crawl through" ineligible URLs. I passed this question along to two of Jason's performance colleagues who mentioned this idea. I'll update here whenever I hear their thoughts.

In theory we could also support either behavior. Maybe with two different flags? As a product designer, I usually discourage punting decisions like this to the user, but there might be two valid use cases here, and I suspect the resulting implementation wouldn't be meaningfully more complicated either way.

esbenam · 2020-12-11T09:30:17Z

Just discovered this tool and tried the spreadsheet, and it is so nice to have such a handy solution easily available, thank you!

I think the filtering would be a great enhancement!

One scenario where this could be useful would be when dealing with multiple translations/markets on a site without having the translation/market in the domain. If the pages are the same across all the languages except for the text content, you might want to ignore all the translated pages. So for instance test shop.com/* but exclude shop.com/fr/, shop.com/de/ etc.

Ignoring languages could of course ignore potential font performance problems in a specific language. Maybe that could be fixed by supporting patterns in include/exclude paths, so you could still include and the test the main entry point for shop.com/fr/ and shop.com/de but avoid crawling them.

rickgregory · 2020-12-12T00:01:00Z

I'd love some kind of filter. My off the cuff idea for doing this would be to support two things:

First, a depth option ( --depth 1 would crawl example.com and example.com/*/ but no deeper).

Second, I'd also like to be able to crawl starting from a given directory, e.g. https://example.com/events/ would crawl all pages in /events, including any pages in subdirectories of /events. So it would crawl https://example.com/events/January.html and also all documents in https://example.com/events/January/.

royteeuwen · 2020-12-13T13:47:07Z

Yup totally have the same use-case! We have multiple brands under the same top domain level, would be nice to choose to start from a given directory

findorf · 2020-12-14T14:56:22Z

Having 14 languages, I like to limit the crawler too. Also it could be nice to be able to limit more than just depth, like N pages pr URL level.

emersonthis · 2020-12-14T19:05:26Z

@calebeby Looks like simplecrawler already supports discoverRegex and maxDepth options, so supporting most of what's described above should be as simple as adding new option flags and passing them through to the crawler.

calebeby · 2021-01-15T21:24:47Z

This has been released in 1.1.0

emersonthis added the enhancement New feature or request label Oct 30, 2020

calebeby self-assigned this Dec 14, 2020

calebeby linked a pull request Dec 18, 2020 that will close this issue

Add depth limiting and path globs options #65

Merged

calebeby closed this as completed in #65 Jan 13, 2021

spirillen mentioned this issue Feb 9, 2025

my-love-shop.com mypdns/matrix#83271

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support filtering / limiting scope of URLs #52

Support filtering / limiting scope of URLs #52

emersonthis commented Oct 30, 2020

calebeby commented Nov 2, 2020

emersonthis commented Nov 3, 2020 •

edited

Loading

esbenam commented Dec 11, 2020

rickgregory commented Dec 12, 2020 •

edited

Loading

royteeuwen commented Dec 13, 2020

findorf commented Dec 14, 2020

emersonthis commented Dec 14, 2020

calebeby commented Jan 15, 2021

Support filtering / limiting scope of URLs #52

Support filtering / limiting scope of URLs #52

Comments

emersonthis commented Oct 30, 2020

calebeby commented Nov 2, 2020

emersonthis commented Nov 3, 2020 • edited Loading

esbenam commented Dec 11, 2020

rickgregory commented Dec 12, 2020 • edited Loading

royteeuwen commented Dec 13, 2020

findorf commented Dec 14, 2020

emersonthis commented Dec 14, 2020

calebeby commented Jan 15, 2021

emersonthis commented Nov 3, 2020 •

edited

Loading

rickgregory commented Dec 12, 2020 •

edited

Loading