Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude URLs from a specific time range #212

Open
MohammedElsayyed opened this issue Feb 3, 2015 · 4 comments
Open

Exclude URLs from a specific time range #212

MohammedElsayyed opened this issue Feb 3, 2015 · 4 comments

Comments

@MohammedElsayyed
Copy link
Member

As stated in wayback.xml, we can use the following configuration to block URLs from the ResourceIndex by creating a plain text file "e.g. /tmp/exclude.txt" which contains URL prefixes:

<bean id="excluder-factory-static" class="org.archive.wayback.accesscontrol.staticmap.StaticMapExclusionFilterFactory">
    <property name="file" value="/tmp/exclude.txt" />
    <property name="checkInterval" value="600000" />
  </bean>

Can we change exclusion file format by including start and end date next to every URL if needed? OpenWayback (ResourceIndex) will check if there is a start and end date, then it will block snapshots which are in that range, else (no start and end date) it will behave normally by blocking it. A 3-column exclusion file format is as follows:

1st column is URL prefix which should be blocked. (required)
2nd and 3rd column are start and end date, respectively. (optional)

@MohammedElsayyed MohammedElsayyed changed the title Excluding URLs in a specific time range Excluding URLs from a specific time range Feb 3, 2015
@MohammedElsayyed MohammedElsayyed changed the title Excluding URLs from a specific time range Exclude URLs from a specific time range Feb 3, 2015
@ldko
Copy link
Member

ldko commented Mar 11, 2015

To me it would make more sense for it to block snapshots for a range given in the exclusion file rather than show the snapshots in that range.

@MohammedElsayyed
Copy link
Member Author

Thanks a lot, Lauren for your review. This is what I mean. I modified it.

Another thing, what if range is not continuous? For instance, exclude from 2001 to 2002 and from 2004 to 2006. Should we write 2 lines in exclusion file with different start and end date for the same URL?

@ldko
Copy link
Member

ldko commented Mar 12, 2015

I like one range per line, so in your example, yes, two lines in the file for the different start and end dates for the same URL.

@westfood
Copy link

Is there way to make it work as inclusion list also? Ie. inclusion file will contain list of URL which are accessible, everything else from index is blocked. It is easier for us to create list of allowed sites then blocked ones.

Our use case for public access would be to allow(include) range of sites we have contract with and exclude few specific from range, regarding copyright violation etc. For onsite access from library we would like to allow everything and exclude URL for copyright violation etc.

Regarding optional date for inclusion file, it would be useful to set more then one date range. Ie. allow site 2002-2008 and 2010-2015, other dates excluded.

Maybe SURTs would be useful too.

I tried to use oracle access, but end up with error log i was unable to deal with. And our colleagues wrote custom class for inclusion behavior years ago for Wayback 1.14/15. I am trying to build this old class to work with Openwayback right now, but i am not sure if i will succeed. Anyway it does not deal with dates or SURTs.

But simple URL exclusion/inclusion class with at least one date range would be great for us.

@hhockx hhockx added this to the 3.0.0 Major Release milestone Jun 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants