Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publicsuffixes2 #9

Merged
merged 3 commits into from
Apr 20, 2012
Merged

Publicsuffixes2 #9

merged 3 commits into from
Apr 20, 2012

Conversation

kngenie
Copy link
Member

@kngenie kngenie commented Apr 18, 2012

reimplementation of PublicSuffixes with radix tree.

@gojomo
Copy link
Member

gojomo commented Apr 18, 2012

A bit confused: in a quick look at PublicSuffixes2, it seems it's still building a big regex string and then Pattern in order to do the key operations. Is that the case? I would have thought those the main memory-consumers.

Where does this get its memory savings, and what's the magnitude of the savings?

Separate comments:

  • there's a similar function in the Google Guava libraries we're now including; we could move to that, though one downside is we'd be subject to their (sometimes slow) schedule of updating from the public suffixes list
  • should add own handle as '@author' or '@contributor' in Javadoc

@kngenie
Copy link
Member Author

kngenie commented Apr 19, 2012

sorry, probably pull request description is misleading. javadoc comment in PublicSuffixes may be too short.

Yes, it still uses the regular expression as old PublicSuffixes did. It was the fastest path to address the problem I found (described in https://webarchive.jira.com/browse/HER-1965).

I added a comment to HER-1965 comparing regular expressions generated by old and new PublicSuffixes. In short, old regular expression has 14,197 (?: )'s, and new regex has 1,386. This results in ~90% smaller Matcher object, and apparently faster matching operation (not a rigorous benchmark, but I saw ~4x improvement). Also pattern generation must be taking less time and memory, but such one-time saving is not a big deal.

It may be possible to implement even more efficient PublicSuffixes leveraging this radix tree approach, but I'm wondering how much effort would be necessary to beat the Java's (supposedly) well-optimized regular expression implementation.

For use of Google Guava library, we've just found a case against it recently: https://webarchive.jira.com/browse/HER-2004

new PublicSuffixes has my name at the bottom of class-level javadoc comment. should it be in different format ("handle"?)

nlevitt added a commit that referenced this pull request Apr 20, 2012
@nlevitt nlevitt merged commit 3fed078 into internetarchive:master Apr 20, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants