You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The width of a feature --- as opposed to frequency --- is the number of unique entries it occurs with. E.g the feature mod:red occurs with the entry bus 3 times and apple 7 times, it's frequency is 10 but it's width is 2.
Due to the inverted-index implementation of the all-pairs algorithms, filtering features with a width greater than some threshold, bounds the quadratic element of the algorithm to the threshold. This should have a dramatic impact on time complexity. It may also work as a statistical stop-word filter.
Implementation will require a number of substantial changes:
For consistency the width filtering should also be applicable to entries. In addition min width, and max frequency filters should be added. This will result in the following addition parameters:
filter.entry.width.min
filter.entry.width.max
filter.feature.width.min
filter.feature.width.max
filter.entry.freq.max
filter.feature.freq.max
Resulting in a total (count based) filtering set:
filter.entry.width.min (Default: 2)
filter.entry.width.max (Default: +Infinity)
filter.feature.width.min (Default: 2)
filter.feature.width.max (Default: +Infinity)
filter.entry.freq.min (Default: 2)
filter.entry.freq.max (Default: +Infinity)
filter.feature.freq.min (Default: 2)
filter.feature.freq.max (Default: +Infinity)
filter.event.freq.min (Default: 2)
filter.event.freq.max (Default: +Infinity)
The width counts must be re-introduced during entries and features counting stage. A non-trivial task since the counts can not simply be summed like frequencies can.
The text was updated successfully, but these errors were encountered:
The width of a feature --- as opposed to frequency --- is the number of unique entries it occurs with. E.g the feature
mod:red
occurs with the entrybus
3 times andapple
7 times, it's frequency is 10 but it's width is 2.Due to the inverted-index implementation of the all-pairs algorithms, filtering features with a width greater than some threshold, bounds the quadratic element of the algorithm to the threshold. This should have a dramatic impact on time complexity. It may also work as a statistical stop-word filter.
This technique is discussed in:
Rychly and Kilgarriff (2007) An efficient algorithm for building a distributional thesaurus. Proc ACL. Prague, Czech Republic. http://kilgarriff.co.uk/Publications/2007-RychlyKilg-ACL-thesauruses.pdf
Implementation will require a number of substantial changes:
For consistency the width filtering should also be applicable to entries. In addition min width, and max frequency filters should be added. This will result in the following addition parameters:
filter.entry.width.min
filter.entry.width.max
filter.feature.width.min
filter.feature.width.max
filter.entry.freq.max
filter.feature.freq.max
Resulting in a total (count based) filtering set:
filter.entry.width.min
(Default: 2)filter.entry.width.max
(Default: +Infinity)filter.feature.width.min
(Default: 2)filter.feature.width.max
(Default: +Infinity)filter.entry.freq.min
(Default: 2)filter.entry.freq.max
(Default: +Infinity)filter.feature.freq.min
(Default: 2)filter.feature.freq.max
(Default: +Infinity)filter.event.freq.min
(Default: 2)filter.event.freq.max
(Default: +Infinity)The width counts must be re-introduced during entries and features counting stage. A non-trivial task since the counts can not simply be summed like frequencies can.
The text was updated successfully, but these errors were encountered: