-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Move text processing APIs and implementation out of cudf into a separate library or package #9555
Comments
There was discussion early in developing nvtext that it should probably live in cuML. Perhaps an NLP or text processing repository would be more useful. |
I agree with the principle that sticking everything in I think a middle path might be to not put functions that we don't think belong to |
Thanks, @VibhuJawa - that's a good point. I'd say that's a reasonable approach also. |
This issue has been labeled |
This issue has been labeled |
Morpheus depends on nvtext subword tokenizer, so it would be nice to have this in it's own library (or just installed in cudf in a proper namespace) so we can use it in a supported fashion. In the mean time we are probably going to have to copy-paste the relevant files in to the Morpheus repo. |
The long-term home for this functionality will be pylibcudf. While some bits of cuDF functionality may remain a superset of pandas, most functions that are "extra" (in the sense of not being part of the pandas API) will likely be removed from cuDF and be primarily accessible from pylibcudf. At that point, if we deem it necessary we could create additional packages wrapping specific parts of pylibcudf functionality that don't fit in cuDF. |
The
StringMethods
accessor in cuDF contains a few complex text processing APIs including e.g., ngram generation subword tokenization. I think those APIs should live in a separate repo, for two reasons:StringMethods
supports (capitalize
is a much simpler operation thansubword_tokenize
).There used to be a separate
nvtext
repository that got merged into cuDF. Perhaps we should consider splitting these functions out again into their own library, especially given that gpuCI is much easier these days to integrate into a new project than it historically was.The text was updated successfully, but these errors were encountered: