-
Notifications
You must be signed in to change notification settings - Fork 275
Configuring ZipNumCluster
MohammedElsayyed edited this page Dec 24, 2014
·
7 revisions
OpenWayback Advanced configuration
For information on ZipNum format http://aaron.blog.archive.org/2013/05/28/zipnum-and-cdx-cluster-merging/
Enable and edit CDXCollection.xml as follows:
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
<property name="canonicalizer" ref="waybackCanonicalizer" />
<property name="source">
<bean class="org.archive.wayback.resourceindex.ZipNumClusterSearchResultSource">
<property name="cluster">
<bean class="org.archive.format.gzip.zipnum.ZipNumCluster">
<property name="summaryFile" value="/<PATH-TO-SUMMARYFILE>"/>
<property name="locFile" value="/<PATH-TO-LOCFILE>" />
</bean>
</property>
<property name="params">
<bean class="org.archive.format.gzip.zipnum.ZipNumParams"/>
</property>
</bean>
</property>
<property name="maxRecords" value="100000" />
<property name="dedupeRecords" value="true" />
</bean>
</property>
Summary file format
Summary file consists of 4 columns separated by tab as follows:
- The first line of each chunk
- Chunk name (or shard name)
- Offset: the starting byte-offset of the chunk
- Length: the length of the chunk
loc file format
Loc file consists of 2 columns separated by tab as follows:
- Chunk name (or shard name)
- Chunk URL: e.g. hdfs://url or http://url
For more information on how to generate summary file using hadoop, please see link at the top.
Copyright © 2005-2022 [tonazol](http://netpreserve.org/). CC-BY. https://github.com/iipc/openwayback.wiki.git