-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heritrix treats inline images as relative URLs #214
Comments
I've been attempting to create an |
Bump @csrster any more details available? |
Sorry Andy, I'm not sure we can find time to dig it up and try to reproduce it right now.
cheers,
Colin
…--
Colin Rosenthal PhD
Senior IT Consultant
Royal Danish Library (Aarhus)
________________________________
From: Andy Jackson <notifications@github.com>
Sent: Thursday, March 14, 2019 4:53 PM
To: internetarchive/heritrix3
Cc: Colin Samuel Rosenthal; Mention
Subject: Re: [internetarchive/heritrix3] Heritrix treats inline images as relative URLs (#214)
Bump @csrster<https://github.com/csrster> any more details available?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#214 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABpBekurxLtb9wQ7ErfNZWMBhEPJ6lnKks5vWnBvgaJpZM4XXfXx>.
|
Adding the following to ExtractorHtmlTest: public void test() throws IOException {
String url = "http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/oversigt";
CrawlURI curi = new CrawlURI(UURIFactory.getInstance(url));
String content = IOUtils.toString(new URL(url).openStream());
getExtractor().extract(curi, content);
CrawlURI[] links = curi.getOutLinks().toArray(new CrawlURI[0]);
Arrays.sort(links);
for (CrawlURI link: links) {
System.out.println(link.getURI());
}
} Yields a lot of log errors like this one:
It doesn't return them as extracted links because of the exception though. |
Hi again, |
Yep. Do you mean there's already a pull request for this? I couldn't find it. Could you link it? |
Digging through our issue history in our private Jira I found this comment:
So what happened here is that the giant URL was constructed from the inline data. We have actually modified MatchesListRegexDecideRule to include a timeout on the regex matching, and logging from the modified MatchesListRegexDecideRule shows that matching of the giant Url with our hideous regex was giving us extra problems on top of the err-log inflation. I think that must mean that at least some of these inline Urls get past the validityCheck. We'll be coming with a separate pull-request for the timeout on the decide rule real soon now. cheers again! |
You must be right Alex - I thought we'd actually made a pull request, but now I see it was only a bug report. Give me a minute! |
For example on this page http://haggmark.dk/solgt/oversigt there is an element
<div class="ejendom rammebaggrund"><a href="http://haggmark.dk/sag/13009"><div class="solgtlabel"><img src="http://haggmark.dk/foto/SolgtLabel
 " style="width:70px;height:70px; border: none;"></div><img class="foto" src="data:image/png;base64,/9j/4AAQSkZJRgABAQEAYABg ...
Heritrix constructs an enormous URL by concatenating the base64 encoded data as if it were a relative path.
The text was updated successfully, but these errors were encountered: