Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add maximum feature width threshold #76

Open
hamishmorgan opened this issue Jul 3, 2012 · 0 comments
Open

Add maximum feature width threshold #76

hamishmorgan opened this issue Jul 3, 2012 · 0 comments

Comments

@hamishmorgan
Copy link
Member

The width of a feature --- as opposed to frequency --- is the number of unique entries it occurs with. E.g the feature mod:red occurs with the entry bus 3 times and apple 7 times, it's frequency is 10 but it's width is 2.

Due to the inverted-index implementation of the all-pairs algorithms, filtering features with a width greater than some threshold, bounds the quadratic element of the algorithm to the threshold. This should have a dramatic impact on time complexity. It may also work as a statistical stop-word filter.

This technique is discussed in:

Rychly and Kilgarriff (2007) An efficient algorithm for building a distributional thesaurus. Proc ACL. Prague, Czech Republic. http://kilgarriff.co.uk/Publications/2007-RychlyKilg-ACL-thesauruses.pdf

Implementation will require a number of substantial changes:

For consistency the width filtering should also be applicable to entries. In addition min width, and max frequency filters should be added. This will result in the following addition parameters:

  • filter.entry.width.min
  • filter.entry.width.max
  • filter.feature.width.min
  • filter.feature.width.max
  • filter.entry.freq.max
  • filter.feature.freq.max

Resulting in a total (count based) filtering set:

  • filter.entry.width.min (Default: 2)
  • filter.entry.width.max (Default: +Infinity)
  • filter.feature.width.min (Default: 2)
  • filter.feature.width.max (Default: +Infinity)
  • filter.entry.freq.min (Default: 2)
  • filter.entry.freq.max (Default: +Infinity)
  • filter.feature.freq.min (Default: 2)
  • filter.feature.freq.max (Default: +Infinity)
  • filter.event.freq.min (Default: 2)
  • filter.event.freq.max (Default: +Infinity)

The width counts must be re-introduced during entries and features counting stage. A non-trivial task since the counts can not simply be summed like frequencies can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant