Support for extracting URLs in sitemaps #262

kris-sigur · 2019-07-05T11:37:24Z

This builds on work by @anjackson . I'm submitting it for discussion. It requires more testing before considering merging.

I've modified the two extractors he wrote, eliminating the BL specific, recrawl, elements. I renamed them to match Heritrix extractor naming conventions. I also added a new hop type, M (Manifest) for the links extracted by them. Other than that, they are mostly as Andy wrote them, barring a few minor tweaks.

As the sitemap extractor requires a new dependency (crawler-commons), I added that to the 'modules' sub-project. This creates a dependency conflict as crawler-commons requires a newer version of commons-io. I upgraded the commons-io (from 1.4 to 2.4) in the 'commons' sub-project. As far as I can tell this doesn't cause any regression.

Bump commons-io dependency version to match crawler-commons.

anjackson · 2019-07-05T13:58:58Z

BTW, taking on crawler-commons as a dependency also means we could consider allowing use of (or switching to) their robots.txt parser. It's more powerful that the current H3 one, and so should be able to resolve #250.

modules/src/main/java/org/archive/modules/extractor/ExtractorRobotsTxt.java

anjackson · 2020-03-04T13:44:24Z

Hm, looks like this needs some more work before we can go ahead, so I'll defer this for now.

anjackson · 2021-05-20T20:05:09Z

Okay, this looks like a sensible port of the UKWA code with the revisit logic taken out. I'll look at shifting our modules to be subclasses of these later on.

kristinn added 5 commits July 5, 2019 11:27

Add crawler-commons dependency.

8ddaf41

Bump commons-io dependency version to match crawler-commons.

Add support for new hopptype, (M)anifest

0f7675a

Add extractor to get sitemap url from robots.txt

5941d59

Add extractor that handles sitemaps

ba8f669

Add sitemap extraction to default profile

308cee8

anjackson reviewed Jul 13, 2019

View reviewed changes

modules/src/main/java/org/archive/modules/extractor/ExtractorRobotsTxt.java Outdated Show resolved Hide resolved

Only copy source tag if not null.

52093eb

ato mentioned this pull request May 12, 2021

Question on robots.txt #371

Closed

anjackson marked this pull request as ready for review May 20, 2021 20:03

Merge branch 'master' into sitemaps

396467c

anjackson merged commit d7869de into internetarchive:master May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for extracting URLs in sitemaps #262

Support for extracting URLs in sitemaps #262

kris-sigur commented Jul 5, 2019

anjackson commented Jul 5, 2019

anjackson commented Mar 4, 2020

anjackson commented May 20, 2021

Support for extracting URLs in sitemaps #262

Support for extracting URLs in sitemaps #262

Conversation

kris-sigur commented Jul 5, 2019

anjackson commented Jul 5, 2019

anjackson commented Mar 4, 2020

anjackson commented May 20, 2021