Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3738: [C++] Parse ISO8601-like timestamps in CSV columns #2952

Closed
wants to merge 1 commit into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Nov 13, 2018

Second granularity is allowed (we might want to add support for fractions of seconds, e.g. in the "YYYY-MM-DD[T ]hh:mm:ss.ssssss" format).

Timestamp conversion also participates in CSV type inference, since it's unlikely to produce false positives (e.g. a semantically "string" column that would be entirely made of valid timestamp strings).

@@ -285,9 +287,9 @@ class day {
explicit CONSTCD11 day(unsigned d) NOEXCEPT;

CONSTCD14 day& operator++() NOEXCEPT;
CONSTCD14 day operator++(int) NOEXCEPT;
CONSTCD14 day operator++(int) NOEXCEPT;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like clang-format ran over this file...

@pitrou pitrou force-pushed the ARROW-3738-csv-timestamps branch 2 times, most recently from 7c0ef4f to 84071f7 Compare November 13, 2018 19:48
@codecov-io
Copy link

codecov-io commented Nov 13, 2018

Codecov Report

Merging #2952 into master will increase coverage by 0.81%.
The diff coverage is 96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2952      +/-   ##
==========================================
+ Coverage   86.65%   87.46%   +0.81%     
==========================================
  Files         493      422      -71     
  Lines       69675    63953    -5722     
==========================================
- Hits        60375    55939    -4436     
+ Misses       9204     8014    -1190     
+ Partials       96        0      -96
Impacted Files Coverage Δ
cpp/src/arrow/util/date.h 90.32% <ø> (ø)
cpp/src/gandiva/to_date_holder.cc 2.08% <ø> (ø) ⬆️
cpp/src/arrow/csv/csv-converter-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/csv/column-builder.cc 97.4% <100%> (+0.08%) ⬆️
cpp/src/arrow/csv/csv-column-builder-test.cc 100% <100%> (ø) ⬆️
cpp/src/arrow/util/parsing-util-test.cc 99.53% <100%> (+0.19%) ⬆️
cpp/src/arrow/csv/converter.cc 95.33% <100%> (+0.5%) ⬆️
cpp/src/arrow/test-util.h 74.45% <100%> (-23.55%) ⬇️
python/pyarrow/tests/test_csv.py 97.94% <100%> (+0.06%) ⬆️
cpp/src/arrow/util/parsing.h 95.6% <87.3%> (-4.4%) ⬇️
... and 117 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7cdab9b...005a6e3. Read the comment docs.

@pitrou pitrou force-pushed the ARROW-3738-csv-timestamps branch from 84071f7 to 904476c Compare November 14, 2018 20:09
@@ -220,6 +221,18 @@ def test_simple_nulls(self):
'e': [b"3", b"nan", b"\xff"],
}

def test_simple_timestamps(self):
# Infer a timestamp column
rows = b"a,b\n1970,1970-01-01\n1989,1989-07-14\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a date column and only a datetime column when it includes hours/minutes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. The original issue was about inferring timpestamp columns, though, and date-only timestamps are a valid kind of timestamps ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a time to these so that we don't have a test that would break when we add date support?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we can fix the test by then. Right now those are inferred as timestamps, and that's what the test checks for.

@pitrou pitrou force-pushed the ARROW-3738-csv-timestamps branch from 4e64568 to 63b3c36 Compare November 21, 2018 13:19
@pitrou
Copy link
Member Author

pitrou commented Nov 21, 2018

Will merge soon if no CI fail.

@wesm
Copy link
Member

wesm commented Nov 21, 2018

I'd still like to have a look -- let me have a look this morning and will merge if no issues

@wesm
Copy link
Member

wesm commented Nov 22, 2018

Sorry to be a bit delayed -- week of Thanksgiving in the US in always a bit challenging. I will try to rebase this and give it a quick review before merging

Second granularity is allowed (we might want to add support for fractions
of seconds, e.g. in the "YYYY-MM-DD[T ]hh:mm:ss.ssssss" format).

Timestamp conversion also participates in CSV type inference, since it's
unlikely to produce false positives (e.g. a semantically "string" column
that would be entirely made of valid timestamp strings).
@wesm wesm force-pushed the ARROW-3738-csv-timestamps branch from 63b3c36 to 005a6e3 Compare November 22, 2018 05:14
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I rebased, so merge this once the build is passing again

@@ -351,6 +358,121 @@ class StringConverter<Int32Type> : public StringToSignedIntConverterMixin<Int32T
template <>
class StringConverter<Int64Type> : public StringToSignedIntConverterMixin<Int64Type> {};

template <>
class StringConverter<TimestampType> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a sense of performance of this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea. Hopefully it shouldn't be too slow...

@wesm
Copy link
Member

wesm commented Nov 22, 2018

I opened https://issues.apache.org/jira/browse/ARROW-3853 about adding a cast implementation that uses this. We should also add benchmarks once we do that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants