-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NUTCH-2522 Bidirectional URL exemption filter #290
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean PR with correct code format and documentation! Thanks, @okedoki!
Afaics, the implementation does the following:
- take the lowercased host part of both from and to URL
- match all regex rules defined in the rules files and remove the matched part
- finally, if from and to host are equal return true => URL is accepted ("exempted" from ignore external host exclusion)
Is this correct?
Wouldn't be a different rule file format more suitable?
- the leading +/- is not used
- don't know whether this makes sense, but could also define the replace string, ev. including references to captured groups, cf. the file format used by urlnormalizer-regex
@@ -0,0 +1,33 @@ | |||
# Licensed to the Apache Software Foundation (ASF) under one or more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configuration files should be added as *.template. And are "instantiated" (copied) during the first compilation. Users than can modify the content without conflicts and undesired overwrites.
|
||
# Example 1: | ||
#---------- | ||
# To exempt urls ending with image extensions, uncomment the below line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description does not fit the following line/rule.
# Format : | ||
#-------- | ||
# The format is same same as `regex-urlfilter.txt`. | ||
# Each non-comment, non-blank line contains a regular expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description does not match the implementation.
# Example 1: | ||
#---------- | ||
# To exempt urls ending with image extensions, uncomment the below line | ||
-(www.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the rule starts with +-
? The dot is not escaped, would also apply to wwwfinder
catching wwwf
.
@sebastian-nagel The usage is correct, at this moment we apply the same regex for both input and output url and see if they match each other. In the future it can be improved with two separated regex for input and output. |
No description provided.