Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for extracting URLs in sitemaps #262

Merged
merged 7 commits into from
May 20, 2021

Conversation

kris-sigur
Copy link
Collaborator

This builds on work by @anjackson . I'm submitting it for discussion. It requires more testing before considering merging.

I've modified the two extractors he wrote, eliminating the BL specific, recrawl, elements. I renamed them to match Heritrix extractor naming conventions. I also added a new hop type, M (Manifest) for the links extracted by them. Other than that, they are mostly as Andy wrote them, barring a few minor tweaks.

As the sitemap extractor requires a new dependency (crawler-commons), I added that to the 'modules' sub-project. This creates a dependency conflict as crawler-commons requires a newer version of commons-io. I upgraded the commons-io (from 1.4 to 2.4) in the 'commons' sub-project. As far as I can tell this doesn't cause any regression.

@anjackson
Copy link
Collaborator

BTW, taking on crawler-commons as a dependency also means we could consider allowing use of (or switching to) their robots.txt parser. It's more powerful that the current H3 one, and so should be able to resolve #250.

@anjackson
Copy link
Collaborator

Hm, looks like this needs some more work before we can go ahead, so I'll defer this for now.

@ato ato mentioned this pull request May 12, 2021
@anjackson anjackson marked this pull request as ready for review May 20, 2021 20:03
@anjackson
Copy link
Collaborator

Okay, this looks like a sensible port of the UKWA code with the revisit logic taken out. I'll look at shifting our modules to be subclasses of these later on.

@anjackson anjackson merged commit d7869de into internetarchive:master May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants