-
Notifications
You must be signed in to change notification settings - Fork 902
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Migrate NVText Tokenizing APIs to pylibcudf (#17100)
Apart of #15162 Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Muhammad Haseeb (https://github.com/mhaseeb123) - Vyas Ramasubramani (https://github.com/vyasr) URL: #17100
- Loading branch information
Showing
11 changed files
with
476 additions
and
122 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,3 +12,4 @@ nvtext | |
normalize | ||
replace | ||
stemmer | ||
tokenize |
6 changes: 6 additions & 0 deletions
6
docs/cudf/source/user_guide/api_docs/pylibcudf/nvtext/tokenize.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
======== | ||
tokenize | ||
======== | ||
|
||
.. automodule:: pylibcudf.nvtext.tokenize | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Copyright (c) 2024, NVIDIA CORPORATION. | ||
|
||
from libcpp.memory cimport unique_ptr | ||
from pylibcudf.column cimport Column | ||
from pylibcudf.libcudf.nvtext.tokenize cimport tokenize_vocabulary | ||
from pylibcudf.libcudf.types cimport size_type | ||
from pylibcudf.scalar cimport Scalar | ||
|
||
cdef class TokenizeVocabulary: | ||
cdef unique_ptr[tokenize_vocabulary] c_obj | ||
|
||
cpdef Column tokenize_scalar(Column input, Scalar delimiter=*) | ||
|
||
cpdef Column tokenize_column(Column input, Column delimiters) | ||
|
||
cpdef Column count_tokens_scalar(Column input, Scalar delimiter=*) | ||
|
||
cpdef Column count_tokens_column(Column input, Column delimiters) | ||
|
||
cpdef Column character_tokenize(Column input) | ||
|
||
cpdef Column detokenize(Column input, Column row_indices, Scalar separator=*) | ||
|
||
cpdef TokenizeVocabulary load_vocabulary(Column input) | ||
|
||
cpdef Column tokenize_with_vocabulary( | ||
Column input, | ||
TokenizeVocabulary vocabulary, | ||
Scalar delimiter, | ||
size_type default_id=* | ||
) |
Oops, something went wrong.