-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/kskip-ngram #82
base: main
Are you sure you want to change the base?
Conversation
Some more information to provide some background... Features:
An overview of how
Let me know if you require further clarification or information. Happy to help and take feedback 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! A very superficial review below,
I'm unsure if this is desirable or the function should join the items into a string to output Vec
I'm also unsure about it. On one side with joined tokens it would be easier to use them later. E.g. in when counting tokens, whether it's unigrams or skip-grams the code would work identically. On the other, it would probably have some performance cost as one would be able to use &str
and would need to create new strings for aggregated tokens. It would also make it difficult to recover tokens from the ngram (scikit-learn/scikit-learn#13733). I think the current approach is fine.
I wanted the struct to consume the input iterator only once and provide an output. This however considerably complicated the code.
Do we know the performance impact of of making multiple passes over the iterator, e.g. with everygrams? I understand the issue with multiple passes over the data as then one would have disk I/O cost, but for a single document, the cost wouldn't likely be too much? Also realistically when using this estimator in a pipeline with a tokenizer, if the run time for this step is much smaller than the tokenizer in any case, optimizing it too much might not be very noticeable.
It's very nice work, I'm just a bit concerned about the complexity of the implementation for future maintenance.
src/ngram_utils/mod.rs
Outdated
/// assert_eq!(grams, vec![vec!["One", "Two", "Three"], vec!["Two", "Three", "Four"]]); | ||
/// ``` | ||
pub fn new_trigram() -> KSkipNGrams { | ||
KSkipNGramsParams::new(3, 3, 0).build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we need these methods? For instance scikit-learn has only a ngram_range
parameter that corresponds to the first two numbers and users seem to find it fine. The additional max_k
parameter would be only a small complexity on top of it, in terms of API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added them as NLTK had them as connivance methods but happy to remove them. It is useful for demonstrating the different examples.
Fair point. I will implement a simpler version and then we can compare performance. You would also loose the ordering of tokens but this isn't such a big problem. |
I have changed the approach and it now chains ngram or skipgram iterators together. Overall its much simpler to understand. Only issue is that you loose the ordering of the tokens as said before but thats not such a big issue. I did some simple benchmarking which aren't included in this pull request and the performance is actually 2x faster which was surprising as the input iterator is split using |
I have added the Python API. Currently it takes a list and output a list. Do we want the Python interface to take a iterator as input and produce an iterator? |
Thanks a lot @joshlk ! I'll review in more detail tomorrow but after a superficial glance this version does significantly less complex.
I think a list is fine. That's what we are doing for tokenizers. If we change that later we should do it later everywhere, but I suspect that lists won't have much performance impact and are easier to manipulate. There is just a linting issue in the CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the thorough work @joshlk ! A very minor comment otherwise LGTM. Would you mind also running cargo clippy
and checking that there are no new warnings in the code? We should probably add clippy
to CI.
I merged one minor suggestion in the test and then had to fix issues it introduced, sorry ) There is indeed a few clippy errors that it would be good to fix,
Otherwise we can merge, unless you wanted to add other things? |
Though in terms of API it's maybe not ideal that if there are not enough tokens e.g. for Also I'm actually somewhat confused as to why |
``` # Testing 19924 documents vtext: everygram: 133.95s [0.7 MB/s, 54 kWPS] nltk: everygram: 5.06s [18.0 MB/s, 1419 kWPS] vtext: skipgram: 498.58s [0.2 MB/s, 14 kWPS] nltk: skipgram: 18.01s [5.1 MB/s, 399 kWPS] ```
Agreed. I have changed it so that empty input or input that is smaller than
Because
I went through the
Do we want to change Also, do we want to include the connivance methods that are included in NLTK in python and rust? (e.g. bigram, everygram ...)
I have created some benchmarks in Python and the vtext functions are currently considerably slower than the nltk equivalents. Not sure why this is the case at this point. Maybe to do with the conversion of strings from rust to python. This was the output:
|
Are you sure you run |
Good point. Let's keep it for now and consider that in a follow up PR as this one is already getting quite large.
Well NLTK doesn't really a have too homogeneous and consistent API across modules, as far as I can tell so I would be careful not too adapt it too closely. But we certainly need to think to how to make all these parts work smoothly.
Hmm, yes I can confirm with Another unrelated thing I was wondering about, it wouldn't be faster and to easier map tokens to an index at the beginning, pass those |
Implemented k-skip-n-grams. Has convenience functions for bigram, trigram, ngrams, everygrams and skipgrams.
Provides the same output as the equivalent nltk functions - although nltk does generate duplicates sometimes which are omitted here. The iterator consumes the input iterator only once and holds a window of items to generate the grams. The window is stepped forward as it consumes the input. It also correctly generates left or right padding if specified.
Currently the iterator outputs
Vec<Vec<&str>>
. I'm unsure if this is desirable or the function should join the items into a string to outputVec<String>
. e.g. Currently it does:vec![vec!["One", "Two"], vec!["Two", "Three"] ...]
but it might be desirable to havevec!["One Two", "Two Three"]
.The nltk implementations tend to consume the input sequence multiple times and concatenate the output (for example everygram). I wanted the struct to consume the input iterator only once and provide an output. This however considerably complicated the code. I have tried to refractor it so it is as readable as much as possible but it would be good to get a second eye on it.
There is also still the question of how to chain the different components together (which links to #21). Currently the
transform
method takes an input iterator and provides an iterator as output which can be consumed by the user.Todo:
Add function for character ngrams