SurtPrefixedDecideRule doesnt seem to work as expected #530
-
Heritrix version : 3.4.0-20210803 I have added my full configuration file here https://gist.github.com/naveen17797/b119d62ef3c0a20e656cae5ea56d5821
For example i have added reddit.com to the seeds list, but i find urls from other domains are present on the crawl log, but since |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
The decide rules work such that the last rule that matches wins. The ACCEPT TransclusionDecideRule will override the acceptSurts rule in order to fetch embedded resources such as images, stylesheets, javascript etc that are needed to render the page. To disable this behaviour and follow acceptSurts strictly even for embedded content then delete or comment out the TransclusionDecideRule. <!-- ...but ACCEPT those more than a configured link-hop-count from start... -->
<bean class="org.archive.modules.deciderules.TransclusionDecideRule">
<!-- <property name="maxTransHops" value="2" /> -->
<!-- <property name="maxSpeculativeHops" value="1" /> -->
</bean> |
Beta Was this translation helpful? Give feedback.
-
More information about how crawl scope works and the individual rules is at: https://heritrix.readthedocs.io/en/stable/configuring-jobs.html#crawl-scope |
Beta Was this translation helpful? Give feedback.
-
Thanks @ato , this fixed the issue, i was under the impression that |
Beta Was this translation helpful? Give feedback.
The decide rules work such that the last rule that matches wins. The ACCEPT TransclusionDecideRule will override the acceptSurts rule in order to fetch embedded resources such as images, stylesheets, javascript etc that are needed to render the page. To disable this behaviour and follow acceptSurts strictly even for embedded content then delete or comment out the TransclusionDecideRule.