Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heritrix treats inline images as relative URLs #214

Closed
csrster opened this issue Oct 11, 2018 · 8 comments · Fixed by #288
Closed

Heritrix treats inline images as relative URLs #214

csrster opened this issue Oct 11, 2018 · 8 comments · Fixed by #288
Labels

Comments

@csrster
Copy link
Contributor

csrster commented Oct 11, 2018

For example on this page http://haggmark.dk/solgt/oversigt there is an element

<div class="ejendom rammebaggrund"><a href="http://haggmark.dk/sag/13009"><div class="solgtlabel"><img src="http://haggmark.dk/foto/SolgtLabel&#xA; " style="width:70px;height:70px; border: none;"></div><img class="foto" src=" ...
Heritrix constructs an enormous URL by concatenating the base64 encoded data as if it were a relative path.

@anjackson
Copy link
Collaborator

I've been attempting to create an ExtractorHTML test case for this, and although it does extract the data URI it doesn't seem to use it as a relative path and construct a HTTP URL from it. Are you using a different extractor? Or perhaps I'm missing something?

@anjackson
Copy link
Collaborator

Bump @csrster any more details available?

@csrster
Copy link
Contributor Author

csrster commented Mar 15, 2019 via email

@ato
Copy link
Collaborator

ato commented Mar 16, 2019

Adding the following to ExtractorHtmlTest:

    public void test() throws IOException {
        String url = "http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/oversigt";
        CrawlURI curi = new CrawlURI(UURIFactory.getInstance(url));
        String content = IOUtils.toString(new URL(url).openStream());
        getExtractor().extract(curi, content);

        CrawlURI[] links = curi.getOutLinks().toArray(new CrawlURI[0]);
        Arrays.sort(links);
        for (CrawlURI link: links) {
            System.out.println(link.getURI());
        }
    }

Yields a lot of log errors like this one:

Mar 16, 2019 4:31:26 PM org.archive.modules.extractor.UnitTestUriLoggerModule logUriError
INFO: http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/oversigt
org.apache.commons.httpclient.URIException: Created (escaped) uuri > 2083: http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/%22
	at org.archive.url.UsableURIFactory.validityCheck(UsableURIFactory.java:327)
	at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:310)
	at org.archive.net.UURIFactory.getInstance(UURIFactory.java:55)
	at org.archive.modules.extractor.Extractor.addRelativeToBase(Extractor.java:190)
	at org.archive.modules.extractor.ExtractorHTML.addLinkFromString(ExtractorHTML.java:663)
	at org.archive.modules.extractor.ExtractorHTML.processEmbed(ExtractorHTML.java:695)
	at org.archive.modules.extractor.ExtractorHTML.processGeneralTag(ExtractorHTML.java:459)
	at org.archive.modules.extractor.ExtractorHTML.extract(ExtractorHTML.java:855)

It doesn't return them as extracted links because of the exception though.

@ato ato added the bug label Mar 17, 2019
@csrster
Copy link
Contributor Author

csrster commented Dec 10, 2019

Hi again,
It seems like we agree that there's a bug here. Iirc our problem wasn't so much with Heritrix queueing these urls but with the heritrix error logs becoming enormous. So we would still be interested in seeing our pull request accepted.
cheers!
Colin

@ato
Copy link
Collaborator

ato commented Dec 10, 2019

Yep. Do you mean there's already a pull request for this? I couldn't find it. Could you link it?

@csrster
Copy link
Contributor Author

csrster commented Dec 10, 2019

Digging through our issue history in our private Jira I found this comment:

2018-10-10 07:06:00.376 INFO thread-62 org.archive.modules.deciderules.MatchesListRegexDecideRule.evaluate() Timeout matching regex '.*[a-zA-Z0-9\W-]+\.dk.*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\1(?=\/).*\2(?=\/).*\3(?=\/)|\1(?=\/).*\3(?=\/).*\2(?=\/)|\2(?=\/).*\1(?=\/).*\3(?=\/)|\2(?=\/).*\3(?=\/).*\1(?=\/)|\3(?=\/).*\2(?=\/).*\1(?=\/)|\3(?=\/).*\1(?=\/).*\2(?=\/)).*' to url 'http://ryd-lortet.dk/%22 ...

So what happened here is that the giant URL was constructed from the inline data. We have actually modified MatchesListRegexDecideRule to include a timeout on the regex matching, and logging from the modified MatchesListRegexDecideRule shows that matching of the giant Url with our hideous regex was giving us extra problems on top of the err-log inflation. I think that must mean that at least some of these inline Urls get past the validityCheck.

We'll be coming with a separate pull-request for the timeout on the decide rule real soon now.

cheers again!
Colin

@csrster
Copy link
Contributor Author

csrster commented Dec 10, 2019

You must be right Alex - I thought we'd actually made a pull request, but now I see it was only a bug report. Give me a minute!
Colin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants