Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new WatchedCDXSource. #184

Closed
wants to merge 3 commits into from
Closed

A new WatchedCDXSource. #184

wants to merge 3 commits into from

Conversation

anjackson
Copy link
Member

This new CDX Source watches for updates.

It requires Java 7.

Created by @PsypherPunk

@kris-sigur
Copy link
Member

As it requires Java 7 it is a candidate for 2.1.0 and not 2.0.1. @johnerikhalse wasn't there going to be a separate branch for 2.1.0 development?

@anjackson anjackson added this to the 2.1.0 Release milestone Nov 5, 2014
@PsypherPunk
Copy link
Contributor

I haven't quite finished with this—need to add deletion functionality and appropriate tests.

@ibnesayeed
Copy link
Contributor

Does it also account for multiple collections with different access point configurations? How will it decide which access point will take care of newly added CDXes? I asked about a way to automatically map the CDX files with collections and their respective access points on the mailing list a while ago, but that seems a manual process with bulky config files.

@PsypherPunk
Copy link
Contributor

As far as I'm aware, a CDX doesn't decide to which AccessPoint it belongs, the AccessPoint defines its collection.

The WatchedCDXSource is just a more dynamic CompositeSearchResultSource tied to a specific folder. There's no reason, however, you can't define several collections, each watching a different folder, each referenced by a specific AccessPoint.

@ibnesayeed
Copy link
Contributor

@PsypherPunk you are right, CDX doesn't decide which access point it belongs to and a CDX can be utilized in more than one access points. But as far as I know, without an access point explicitly including one or more CDX files, they are of no good use. And as far as I am aware, these access points don't accept wildcards. Are you suggesting that an access point can be set to account all the CDX files in a directory without explicitly adding names of all the CDX files in that directory? If that is true then I am going to make some serious changes on my deployment and I would love to see the relevant documentation if there is any.

@PsypherPunk
Copy link
Contributor

That's the point of this new WatchedCDXSource: you give it the name of a directory and it automatically adds/removes CDXs as they appear/disappear. It also includes already-existing files at startup.

You're quite right though, there's no documentation; most of it came from a discussion on the mailing list which spawned this.

Presuming you've built with the WatchedCDXSource included, modify your CDXCollection.xml and replacing the existing resourceIndex's source as per the following:

<property name="resourceIndex">
    <bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
        <property name="canonicalizer" ref="waybackCanonicalizer" />
        <property name="source">
            <bean class="org.archive.wayback.resourceindex.WatchedCDXSource">
                <property name="path" value="${wayback.basedir}/cdx-index/" />
            </bean>
        </property>
        <property name="maxRecords" value="10000" />
        <property name="dedupeRecords" value="true" />    
    </bean>
</property>

The path is just a directory; there's no explicit reference to any CDX files, just their parent directory.

@kris-sigur
Copy link
Member

Question, does this blindly assume that anything under path is a CDX? Or does it employ some kind of filter?

@PsypherPunk
Copy link
Contributor

At the moment, yes, intentionally so. It uses the existing CDXIndex class which doesn't have any such filtering in place.

However, a configurable filter could be added if that's a requirement...?

@ibnesayeed
Copy link
Contributor

Thanks @PsypherPunk, I am anticipating its release sooner than later. This will partially solve the problem discussed in the mailing list at https://groups.google.com/forum/#!topic/openwayback-dev/oGp2fWWUauQ

Does it watch the directory recursively? If yes, then I think it will be very easy to have a root directory in which there can be collection specific directories that will hold CDX of relevant collections. Now if an access point is set to watch the rood directory, it can serve as /all while collection specific access points can watch only the relevant directories.

@kris-sigur
Copy link
Member

Thanks @PsypherPunk.

Personally, I'd like to see the following configuration options added:

  • Filter - Probably a regex defining which filenames to consider as CDXs. Default to something like .*\.cdx$
  • Recursive - Boolean, whether or not to traverse sub-directories. This feature makes the filter, potentially, even more useful. Default to false.

@PsypherPunk
Copy link
Contributor

@kris-sigur, can do: however, that might have to take place on a separate pull request as I've no idea what the hell I've done with this one (note the unknown repository at the top of this one).

Any objection to merging this in its current form and I'll pick up the aforementioned as an enhancement?

@kris-sigur kris-sigur closed this Feb 17, 2015
@kris-sigur
Copy link
Member

@PsypherPunk Apparently, I managed to accidentally click the wrong button, closing the pull request and since the underlying repository is gone I can't reopen it. My apologies. You mind resubmitting this as a fresh PR?

@PsypherPunk
Copy link
Contributor

@kris-sigur, will do; I'll try not to nuke the repo. this time.

kris-sigur pushed a commit that referenced this pull request Feb 17, 2015
WatchedCDXSource; replacement for #184.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants