Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support filtering / limiting scope of URLs #52

Closed
emersonthis opened this issue Oct 30, 2020 · 8 comments · Fixed by #65
Closed

Support filtering / limiting scope of URLs #52

emersonthis opened this issue Oct 30, 2020 · 8 comments · Fixed by #65
Assignees
Labels
enhancement New feature or request

Comments

@emersonthis
Copy link

Andy D:

One thing I didn't see in the docs is whether it was possible to limit the depth or number of pages in the crawl - on some sites (retailers / publishers) I could see the crawl size getting pretty large

Others have asked about maybe some kind of flag to filter urls. All seem to be thinking about the same use-case: more efficiently analyzing chunks of a big site.

@emersonthis emersonthis added the enhancement New feature or request label Oct 30, 2020
@calebeby
Copy link
Member

calebeby commented Nov 2, 2020

I like this idea! Should URLs passed via the flag be excluded both from crawling and lighthouse-ing? Or just from lighthouse-ing? i.e. if a URL is excluded, should the crawler discover pages that are linked from the excluded page? @emersonthis

@emersonthis
Copy link
Author

emersonthis commented Nov 3, 2020

Great question. My gut is that we'd want to filter the reports but "crawl through" ineligible URLs. I passed this question along to two of Jason's performance colleagues who mentioned this idea. I'll update here whenever I hear their thoughts.

In theory we could also support either behavior. Maybe with two different flags? As a product designer, I usually discourage punting decisions like this to the user, but there might be two valid use cases here, and I suspect the resulting implementation wouldn't be meaningfully more complicated either way.

@esbenam
Copy link

esbenam commented Dec 11, 2020

Just discovered this tool and tried the spreadsheet, and it is so nice to have such a handy solution easily available, thank you!

I think the filtering would be a great enhancement!

One scenario where this could be useful would be when dealing with multiple translations/markets on a site without having the translation/market in the domain. If the pages are the same across all the languages except for the text content, you might want to ignore all the translated pages. So for instance test shop.com/* but exclude shop.com/fr/, shop.com/de/ etc.

Ignoring languages could of course ignore potential font performance problems in a specific language. Maybe that could be fixed by supporting patterns in include/exclude paths, so you could still include and the test the main entry point for shop.com/fr/ and shop.com/de but avoid crawling them.

@rickgregory
Copy link

rickgregory commented Dec 12, 2020

I'd love some kind of filter. My off the cuff idea for doing this would be to support two things:

First, a depth option ( --depth 1 would crawl example.com and example.com/*/ but no deeper).

Second, I'd also like to be able to crawl starting from a given directory, e.g. https://example.com/events/ would crawl all pages in /events, including any pages in subdirectories of /events. So it would crawl https://example.com/events/January.html and also all documents in https://example.com/events/January/.

@royteeuwen
Copy link

Yup totally have the same use-case! We have multiple brands under the same top domain level, would be nice to choose to start from a given directory

@findorf
Copy link

findorf commented Dec 14, 2020

Having 14 languages, I like to limit the crawler too. Also it could be nice to be able to limit more than just depth, like N pages pr URL level.

@emersonthis
Copy link
Author

@calebeby Looks like simplecrawler already supports discoverRegex and maxDepth options, so supporting most of what's described above should be as simple as adding new option flags and passing them through to the crawler.

@calebeby calebeby self-assigned this Dec 14, 2020
@calebeby calebeby linked a pull request Dec 18, 2020 that will close this issue
@calebeby
Copy link
Member

This has been released in 1.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants