-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3738: [C++] Parse ISO8601-like timestamps in CSV columns #2952
Conversation
@@ -285,9 +287,9 @@ class day { | |||
explicit CONSTCD11 day(unsigned d) NOEXCEPT; | |||
|
|||
CONSTCD14 day& operator++() NOEXCEPT; | |||
CONSTCD14 day operator++(int) NOEXCEPT; | |||
CONSTCD14 day operator++(int) NOEXCEPT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like clang-format
ran over this file...
7c0ef4f
to
84071f7
Compare
Codecov Report
@@ Coverage Diff @@
## master #2952 +/- ##
==========================================
+ Coverage 86.65% 87.46% +0.81%
==========================================
Files 493 422 -71
Lines 69675 63953 -5722
==========================================
- Hits 60375 55939 -4436
+ Misses 9204 8014 -1190
+ Partials 96 0 -96
Continue to review full report at Codecov.
|
84071f7
to
904476c
Compare
@@ -220,6 +221,18 @@ def test_simple_nulls(self): | |||
'e': [b"3", b"nan", b"\xff"], | |||
} | |||
|
|||
def test_simple_timestamps(self): | |||
# Infer a timestamp column | |||
rows = b"a,b\n1970,1970-01-01\n1989,1989-07-14\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a date
column and only a datetime
column when it includes hours/minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps. The original issue was about inferring timpestamp columns, though, and date-only timestamps are a valid kind of timestamps ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding a time to these so that we don't have a test that would break when we add date
support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we can fix the test by then. Right now those are inferred as timestamps, and that's what the test checks for.
4e64568
to
63b3c36
Compare
Will merge soon if no CI fail. |
I'd still like to have a look -- let me have a look this morning and will merge if no issues |
Sorry to be a bit delayed -- week of Thanksgiving in the US in always a bit challenging. I will try to rebase this and give it a quick review before merging |
Second granularity is allowed (we might want to add support for fractions of seconds, e.g. in the "YYYY-MM-DD[T ]hh:mm:ss.ssssss" format). Timestamp conversion also participates in CSV type inference, since it's unlikely to produce false positives (e.g. a semantically "string" column that would be entirely made of valid timestamp strings).
63b3c36
to
005a6e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I rebased, so merge this once the build is passing again
@@ -351,6 +358,121 @@ class StringConverter<Int32Type> : public StringToSignedIntConverterMixin<Int32T | |||
template <> | |||
class StringConverter<Int64Type> : public StringToSignedIntConverterMixin<Int64Type> {}; | |||
|
|||
template <> | |||
class StringConverter<TimestampType> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a sense of performance of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea. Hopefully it shouldn't be too slow...
I opened https://issues.apache.org/jira/browse/ARROW-3853 about adding a cast implementation that uses this. We should also add benchmarks once we do that |
Second granularity is allowed (we might want to add support for fractions of seconds, e.g. in the "YYYY-MM-DD[T ]hh:mm:ss.ssssss" format).
Timestamp conversion also participates in CSV type inference, since it's unlikely to produce false positives (e.g. a semantically "string" column that would be entirely made of valid timestamp strings).