You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all I added gson-2.8.6.jar and heritrix-contrib-3.4.0-20200304.jar to the lib/ directory and added them into my CLASSPATH.
Here is my configuration:
Youtube-dl Package:
[root@archive:/zarchive/heritrix/jobs/test.tkrn.io/latest]#apt list | grep youtube-dl
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
youtube-dl/stable,now 2019.01.17-1.1 all [installed]
Top Level Bean and Fetch Chain:
.....
<bean id="extractorCss" class="org.archive.modules.extractor.ExtractorCSS">
</bean>
<bean id="extractorJs" class="org.archive.modules.extractor.ExtractorJS">
</bean>
<bean id="extractorSwf" class="org.archive.modules.extractor.ExtractorSWF">
</bean>
<bean id="extractorYoutubeDL" class="org.archive.modules.extractor.ExtractorYoutubeDL">
</bean>
<!-- now, processors are assembled into ordered FetchChain bean -->
<bean id="fetchProcessors" class="org.archive.modules.FetchChain">
<property name="processors">
<list>
<!-- re-check scope, if so enabled... -->
<ref bean="preselector"/>
<!-- ...then verify or trigger prerequisite URIs fetched, allow crawling... -->
<ref bean="preconditions"/>
<!-- ...fetch if DNS URI... -->
<ref bean="fetchDns"/>
<!-- <ref bean="fetchWhois"/> -->
<!-- ...fetch if HTTP URI... -->
<ref bean="fetchHttp"/>
<!-- ...extract outlinks from HTTP headers... -->
<ref bean="extractorHttp"/>
<!-- ...extract outlinks from HTML content... -->
<ref bean="extractorHtml"/>
<!-- ...extract outlinks from CSS content... -->
<ref bean="extractorCss"/>
<!-- ...extract outlinks from Javascript content... -->
<ref bean="extractorJs"/>
<!-- ...extract outlinks from Flash content... -->
<ref bean="extractorSwf"/>
<!-- ...extract outlooks from YoutTube content... -->
<ref bean="extractorYoutubeDL"/>
</list>
</property>
</bean>
Warc Writer Chain
<!-- now, processors are assembled into ordered DispositionChain bean -->
<bean id="dispositionProcessors" class="org.archive.modules.DispositionChain">
<property name="processors">
<list>
<!-- write to aggregate archival files... -->
<ref bean="warcWriter"/>
<!-- ...send each outlink candidate URI to CandidateChain,
and enqueue those ACCEPTed to the frontier... -->
<ref bean="extractorYoutubeDL"/>
<ref bean="candidates"/>
<!-- ...then update stats, shared-structures, frontier decisions -->
<ref bean="disposition"/>
<!-- <ref bean="rescheduler" /> -->
</list>
</property>
</bean>
This discussion was converted from issue #324 on September 30, 2022 00:45.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
First of all I added gson-2.8.6.jar and heritrix-contrib-3.4.0-20200304.jar to the lib/ directory and added them into my CLASSPATH.
Here is my configuration:
Youtube-dl Package:
Top Level Bean and Fetch Chain:
Warc Writer Chain
extractYoutubeDL.log - It's empty...
Am I missing anything? When I try to replay the Warc them embedded youtube video is not captured.
Beta Was this translation helpful? Give feedback.
All reactions