Heritrix treats inline images as relative URLs #214

csrster · 2018-10-11T12:59:51Z

For example on this page http://haggmark.dk/solgt/oversigt there is an element

<div class="ejendom rammebaggrund"><a href="http://haggmark.dk/sag/13009"><div class="solgtlabel"><img src="http://haggmark.dk/foto/SolgtLabel
 " style="width:70px;height:70px; border: none;"></div><img class="foto" src="data:image/png;base64,/9j/4AAQSkZJRgABAQEAYABg ...
Heritrix constructs an enormous URL by concatenating the base64 encoded data as if it were a relative path.

The text was updated successfully, but these errors were encountered:

anjackson · 2019-02-04T13:09:37Z

I've been attempting to create an ExtractorHTML test case for this, and although it does extract the data URI it doesn't seem to use it as a relative path and construct a HTTP URL from it. Are you using a different extractor? Or perhaps I'm missing something?

anjackson · 2019-03-14T15:53:13Z

Bump @csrster any more details available?

csrster · 2019-03-15T09:36:34Z

Sorry Andy, I'm not sure we can find time to dig it up and try to reproduce it right now. cheers, Colin

…

-- Colin Rosenthal PhD Senior IT Consultant Royal Danish Library (Aarhus)

________________________________ From: Andy Jackson <notifications@github.com> Sent: Thursday, March 14, 2019 4:53 PM To: internetarchive/heritrix3 Cc: Colin Samuel Rosenthal; Mention Subject: Re: [internetarchive/heritrix3] Heritrix treats inline images as relative URLs (#214) Bump @csrster<https://github.com/csrster> any more details available? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#214 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABpBekurxLtb9wQ7ErfNZWMBhEPJ6lnKks5vWnBvgaJpZM4XXfXx>.

ato · 2019-03-16T07:39:23Z

Adding the following to ExtractorHtmlTest:

    public void test() throws IOException {
        String url = "http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/oversigt";
        CrawlURI curi = new CrawlURI(UURIFactory.getInstance(url));
        String content = IOUtils.toString(new URL(url).openStream());
        getExtractor().extract(curi, content);

        CrawlURI[] links = curi.getOutLinks().toArray(new CrawlURI[0]);
        Arrays.sort(links);
        for (CrawlURI link: links) {
            System.out.println(link.getURI());
        }
    }

Yields a lot of log errors like this one:

Mar 16, 2019 4:31:26 PM org.archive.modules.extractor.UnitTestUriLoggerModule logUriError
INFO: http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/oversigt
org.apache.commons.httpclient.URIException: Created (escaped) uuri > 2083: http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/%22data:image/png;base64,/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYHBwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcIDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAARCACCAMMDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD6L+Pnxon8IrD4Y0zUry68YeUt7KtmZd1rEpz5kiJjKPggrnpk46V7PpUs19aQzR3EzRzIrqd55BGa+Tfjb+0a3w18ZeG/C1946YLa2rWmp6xbWYmvwZJQsNwZdoVQ0TBN52qzE4IzX0p+zV4i0PxH8I7K+0nxL/wkWmwwB/tUswllVBkFmwzHHB7kDB+g/V8n4mpYnNK9GMrqKSs7aNdL8zcnq1J90fi2Oy90sLTny73111u/RW2Vl2Z2FvHcZ/10/wD32atq00QJa4m+UbiN54HrioNO8Z6TP4sk0j7TAs0Vmt95rSr5ckTdCrZ545+ntXF/FP4zWdz8JLXxFod+slrdyMtjeWn7yFpVYrtZseoP3SO/XFermOfUMPRlWuny328v+HRxYXA1Ks1G1r2/H/hmeiQTyusZW6mZZhujIkO1h14PTpV2Nbrb/r5v+/hr4uufi7rXiKaSDTdej2xbpruCQvHslGFeNCCQGOCcn+Lrwa9m/Zp/avX4s6/o+mvdLMssKWwDokM0kgiUl3BOST8xwo4OK+Tyzj7B4yo6bi4vS1+tz1MZw7Xox54tM9wWO5z/AMfE/wCDmrEK3A/5bTn/AIGa1ILSGe6khjZZJIfvqvOz6+lWo9H4r6r63B7WPB9m09TJjN0DxPOP+2hqzAt0f+Xi4/7+GtOLSgO1TPapbRlpGWNF6s52gVz1MVFK+hpGPRGWY7oD/j4n/wC/ho8y6X/l4uP++zWXpvxNsdV1rU47d42sdJhaWWYHPm4HReeDnPXGeMda2vDuv2Pi2LdZt5m1FZuMYJHT1OPXpXDRzTD1fgad9vM7KmBqQ+JEObkj/j5uP+/hprC6zxcT/Xea2ZNLx2pBYbK61WiYezMN3ugP+Pif/vs1E73RP/HxP7fOa33sNx+7ULadVxrxF7MwnNx/z3n/AO+zVeU3H/PxP/38NdDJYVVk07muiNdC5Tnphc/895v++zVWX7Sv/LxP/wB9mujk06q0umn+7XVDEIXKc3cG4P8Ay2m/77NUbn7SP+W83/fZrp59NyelU7nS8dq7KeIiTydjl7g3H/Peb/vs1QuTcHrNN/32a6m40zJ6c1QudOx1Wu+liIkuJ554h+0DVpP9IuOi/wAZ/uiitjxFpn/E4l47L/6CKK1+smkVofj74x8b+DdDvdX1fUpNDa/8SWRbR3srmae1u
	at org.archive.url.UsableURIFactory.validityCheck(UsableURIFactory.java:327)
	at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:310)
	at org.archive.net.UURIFactory.getInstance(UURIFactory.java:55)
	at org.archive.modules.extractor.Extractor.addRelativeToBase(Extractor.java:190)
	at org.archive.modules.extractor.ExtractorHTML.addLinkFromString(ExtractorHTML.java:663)
	at org.archive.modules.extractor.ExtractorHTML.processEmbed(ExtractorHTML.java:695)
	at org.archive.modules.extractor.ExtractorHTML.processGeneralTag(ExtractorHTML.java:459)
	at org.archive.modules.extractor.ExtractorHTML.extract(ExtractorHTML.java:855)

It doesn't return them as extracted links because of the exception though.

csrster · 2019-12-10T09:05:58Z

Hi again,
It seems like we agree that there's a bug here. Iirc our problem wasn't so much with Heritrix queueing these urls but with the heritrix error logs becoming enormous. So we would still be interested in seeing our pull request accepted.
cheers!
Colin

ato · 2019-12-10T09:18:20Z

Yep. Do you mean there's already a pull request for this? I couldn't find it. Could you link it?

csrster · 2019-12-10T09:21:29Z

Digging through our issue history in our private Jira I found this comment:

2018-10-10 07:06:00.376 INFO thread-62 org.archive.modules.deciderules.MatchesListRegexDecideRule.evaluate() Timeout matching regex '.*[a-zA-Z0-9\W-]+\.dk.*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\/[a-zA-Z0-9\W-]{3,})(?=\/).*(\1(?=\/).*\2(?=\/).*\3(?=\/)|\1(?=\/).*\3(?=\/).*\2(?=\/)|\2(?=\/).*\1(?=\/).*\3(?=\/)|\2(?=\/).*\3(?=\/).*\1(?=\/)|\3(?=\/).*\2(?=\/).*\1(?=\/)|\3(?=\/).*\1(?=\/).*\2(?=\/)).*' to url 'http://ryd-lortet.dk/%22data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAEnCAYAAACHcBUB ...

So what happened here is that the giant URL was constructed from the inline data. We have actually modified MatchesListRegexDecideRule to include a timeout on the regex matching, and logging from the modified MatchesListRegexDecideRule shows that matching of the giant Url with our hideous regex was giving us extra problems on top of the err-log inflation. I think that must mean that at least some of these inline Urls get past the validityCheck.

We'll be coming with a separate pull-request for the timeout on the decide rule real soon now.

cheers again!
Colin

csrster · 2019-12-10T09:23:24Z

You must be right Alex - I thought we'd actually made a pull request, but now I see it was only a bug report. Give me a minute!
Colin

ato added the bug label Mar 17, 2019

csrster mentioned this issue Dec 10, 2019

Attempt to filter out embedded images. #288

Merged

ato closed this as completed Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heritrix treats inline images as relative URLs #214

Heritrix treats inline images as relative URLs #214

csrster commented Oct 11, 2018

anjackson commented Feb 4, 2019

anjackson commented Mar 14, 2019

csrster commented Mar 15, 2019 via email

ato commented Mar 16, 2019

csrster commented Dec 10, 2019

ato commented Dec 10, 2019

csrster commented Dec 10, 2019

csrster commented Dec 10, 2019

Heritrix treats inline images as relative URLs #214

Heritrix treats inline images as relative URLs #214

Comments

csrster commented Oct 11, 2018

anjackson commented Feb 4, 2019

anjackson commented Mar 14, 2019

csrster commented Mar 15, 2019 via email

ato commented Mar 16, 2019

csrster commented Dec 10, 2019

ato commented Dec 10, 2019

csrster commented Dec 10, 2019

csrster commented Dec 10, 2019