-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support filtering / limiting scope of URLs #52
Comments
I like this idea! Should URLs passed via the flag be excluded both from crawling and lighthouse-ing? Or just from lighthouse-ing? i.e. if a URL is excluded, should the crawler discover pages that are linked from the excluded page? @emersonthis |
Great question. My gut is that we'd want to filter the reports but "crawl through" ineligible URLs. I passed this question along to two of Jason's performance colleagues who mentioned this idea. I'll update here whenever I hear their thoughts. In theory we could also support either behavior. Maybe with two different flags? As a product designer, I usually discourage punting decisions like this to the user, but there might be two valid use cases here, and I suspect the resulting implementation wouldn't be meaningfully more complicated either way. |
Just discovered this tool and tried the spreadsheet, and it is so nice to have such a handy solution easily available, thank you! I think the filtering would be a great enhancement! One scenario where this could be useful would be when dealing with multiple translations/markets on a site without having the translation/market in the domain. If the pages are the same across all the languages except for the text content, you might want to ignore all the translated pages. So for instance test shop.com/* but exclude shop.com/fr/, shop.com/de/ etc. Ignoring languages could of course ignore potential font performance problems in a specific language. Maybe that could be fixed by supporting patterns in include/exclude paths, so you could still include and the test the main entry point for shop.com/fr/ and shop.com/de but avoid crawling them. |
I'd love some kind of filter. My off the cuff idea for doing this would be to support two things: First, a depth option ( --depth 1 would crawl example.com and example.com/*/ but no deeper). Second, I'd also like to be able to crawl starting from a given directory, e.g. https://example.com/events/ would crawl all pages in /events, including any pages in subdirectories of /events. So it would crawl https://example.com/events/January.html and also all documents in https://example.com/events/January/. |
Yup totally have the same use-case! We have multiple brands under the same top domain level, would be nice to choose to start from a given directory |
Having 14 languages, I like to limit the crawler too. Also it could be nice to be able to limit more than just depth, like N pages pr URL level. |
@calebeby Looks like simplecrawler already supports discoverRegex and maxDepth options, so supporting most of what's described above should be as simple as adding new option flags and passing them through to the crawler. |
This has been released in |
Andy D:
Others have asked about maybe some kind of flag to filter urls. All seem to be thinking about the same use-case: more efficiently analyzing chunks of a big site.
The text was updated successfully, but these errors were encountered: