-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate CSV reader to pylibcudf #16011
Migrate CSV reader to pylibcudf #16011
Conversation
8093233
to
989a21e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some small suggestions for improvement, but I don't need to review this again so feel free to go ahead with this PR once you feel you've addressed my comments sufficiently.
elif ( | ||
cudf.api.types.is_scalar(dtype) or | ||
isinstance(dtype, ( | ||
np.dtype, pd.api.extensions.ExtensionDtype, type | ||
)) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could handle the scalar case up front by wrapping it in a list to keep things simpler. Then you have new_dtypes
as a list in the list branch and it's a dict in the mapping branch of this conditional.
c_na_values.reserve(len(na_values)) | ||
for nv in na_values: | ||
if not isinstance(nv, str): | ||
raise TypeError("na_values must be a list of str!") | ||
c_na_values.push_back(nv.encode()) | ||
options.set_na_values(c_na_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type of code is repeated a lot in this function's parsing of inputs. It feels like a helper function along the lines of
cdef vector[T] vec_from_iterable(vec, test, error_msg)
could help. OTOH maybe that's overengineering since you'll still have to write predicate functions and error messages each time. Maybe give it a shot once and see what it looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda did something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small suggestions
for dtype in dtypes: | ||
if not isinstance(dtype, DataType): | ||
raise TypeError("If passing list to read_csv, " | ||
"all elements must be of type `DataType`!") | ||
c_dtypes_list.push_back((<DataType>dtype).c_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be happy with a simpler structure of:
options.set_dtypes([(<DataType?>dtype).c_obj for dtype in dtypes])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work since you can't put C objects in a list comprehension.
from cudf._lib.pylibcudf.types cimport DataType | ||
|
||
|
||
cpdef TableWithMetadata read_csv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: (best as an issue) in a followup, expose the csv_reader_options_builder
object. So that we don't have to use this nightmare signature :)
|
||
cpdef TableWithMetadata read_csv( | ||
SourceInfo source_info, | ||
compression_type compression = compression_type.AUTO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compression_type compression = compression_type.AUTO, | |
*, | |
compression_type compression = compression_type.AUTO, |
People should not be allowed to call this function with positional arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work with cpdef functions.
Lets punt on this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be cpdef
? I am willing to accept a slight calling cost overhead to avoid inevitable argument order issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, put it as def for now.
We should try to make this consistent in the future, though.
(Either put everything as def, or deprecate and remove a bunch of the parameters in read_csv and turn this back into cpdef)
Co-authored-by: Lawrence Mitchell <lmitchell@nvidia.com> Co-authored-by: Vyas Ramasubramani <vyasr@nvidia.com>
Self-merging to keep progress moving forward. As usual, happy to address further comments in a followup. |
/merge |
Description
xref #15162
Checklist