-
Notifications
You must be signed in to change notification settings - Fork 3.7k
ARROW-835: [Format][C++][Java] Create a new Duration type #3644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3644 +/- ##
==========================================
- Coverage 88.07% 88.06% -0.01%
==========================================
Files 773 774 +1
Lines 96887 97162 +275
Branches 1251 1251
==========================================
+ Hits 85329 85566 +237
- Misses 11322 11360 +38
Partials 236 236
Continue to review full report at Codecov.
|
java/vector/src/main/java/org/apache/arrow/vector/IntervalEpochMilliVector.java
Outdated
Show resolved
Hide resolved
@wesm could you look at the format changes to see if you are ok with them? |
Failure was NodeJs |
since this is a format change (backward compatible though), I'll wait for @wesm to approve too before merging. |
@pravindra thanks for the review |
format/Schema.fbs
Outdated
/// millisecond, but can be any of the other supported TimeUnit values as | ||
/// with Timestamp and Time types. This type is always represented as | ||
/// an 8-byte integer. | ||
table DurationInterval { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking over the old e-mail threads it might be that TimeDelta or TimestampDelta is a better name for this type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you happy with "DurationInterval"? It could also be called simply "Duration" or "TimeDuration"
@wesm based on comments on the other JIRAs CLs it seems like it would be desirable to add an implementation for C++, I can add that to this CL but would you mind taking a look to see if you are ok with naming/modeling (I would prefer not to have to refactor just due to naming issues if possible). |
I believe the Python on OS/X error is spurious (it was green on previous build). |
Will work on reviewing the C++ side of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ side of this, and the changes to the Flatbuffers files, all looks good to me. Thanks @emkornfield for slogging through this and implementing C++, Java, and integration tests in one go!
One question I have is whether we want to commit to the name "DurationInterval" rather than alternatives such as "Duration" or "TimeDelta" or "TimeDuration". I don't think the names are sacred
Can someone (@siddharthteotia ?) carve out a little time to review the Java so we can try to get this merged soon?
{ class: "DurationIntervalMicro", javaType: "long", friendlyType: "Duration", | ||
arrowType: "org.apache.arrow.vector.types.pojo.ArrowType.DurationInterval"} | ||
{ class: "DurationIntervalNano", javaType: "long", friendlyType: "Duration", | ||
arrowType: "org.apache.arrow.vector.types.pojo.ArrowType.Duration"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's possible to collapse this into a single class? There was an effort to do this for Timestamp (with a TimeUnit
class) but it failed only because Dremio had taken on dependency with the TimestampUNIT classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially, I followed the pattern already established in the codebase, but I'm open to trying if it is desirable. @pravindra @siddharthteotia do you think this is desirable/possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds okay to me. Let's see what @siddharthteotia and @pravindra say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, having a single class is definitely desirable. Single this one is starting on a fresh slate, it's worth a try.
@@ -650,6 +650,36 @@ class ARROW_EXPORT FixedSizeBinaryArray : public PrimitiveArray { | |||
int32_t byte_width_; | |||
}; | |||
|
|||
/// DayTimeArray | |||
/// --------------------- | |||
/// \brief Array of Day and Millisecond values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I'm a bit disappointed that we aren't deprecating this type =/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to get dragged into what it means deprecate this, should be easy enough to do if we want it. As noted below I tried to say it was optional which I seem to recall seeing this was ok on a previous ML thread.
cpp/src/arrow/array/builder_time.h
Outdated
using Time32Builder = NumericBuilder<Time32Type>; | ||
using Time64Builder = NumericBuilder<Time64Type>; | ||
using Date32Builder = NumericBuilder<Date32Type>; | ||
using Date64Builder = NumericBuilder<Date64Type>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this redundant with type_fwd.h
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it appears so, I'll push a commit removing them. It looks like there might be more redundancy in builder_primitive.h
cpp/src/arrow/builder.cc
Outdated
@@ -111,6 +112,18 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr<DataType>& type, | |||
DictionaryBuilderCase visitor = {pool, dict_type, out}; | |||
return VisitTypeInline(*dict_type.dictionary()->type(), &visitor); | |||
} | |||
case Type::INTERVAL: { | |||
const auto& interval_type = dynamic_cast<const IntervalType&>(*type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think static_cast
(or checked_cast
) is sufficient here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
std::shared_ptr<DataType> ARROW_EXPORT day_time_interval(); | ||
|
||
/// \brief Return an MonthIntervalType instance | ||
std::shared_ptr<DataType> ARROW_EXPORT month_interval(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: as far as naming I could see an argument for using
interval_duration
interval_day_time
interval_month
but I'm OK with the way things are here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see this to, but I'm happier without the reverse notation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note "duration" conflicts with some of our time classes so I renamed to "duration_type", hopefully that is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"some of our time classes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to look back I think it was in the vendored time library we were using. I can look deeper if you don't think there should be a conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking again they shouldn't conflict, I will try to fix this today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed. had to touch some of the vendored code.
// 4-byte integers. | ||
// DAY_TIME - Indicates the number of elapsed days and milliseconds, | ||
// stored as 2 contiguous 32-bit integers (8-bytes in total). Support | ||
// of this IntervalUnit is not required for full arrow compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. If it isn't being deprecated then it's reasonable to implement it in C++ and have integration tests
format/Schema.fbs
Outdated
/// millisecond, but can be any of the other supported TimeUnit values as | ||
/// with Timestamp and Time types. This type is always represented as | ||
/// an 8-byte integer. | ||
table DurationInterval { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you happy with "DurationInterval"? It could also be called simply "Duration" or "TimeDuration"
@wesm Thanks for the review I'll try to address comments in the next few days. In terms of naming, I don't know the etiquette for changing the name once it gets approved on the ML. If was going to change it, I think Duration would be my choice. @pravindra already reviewed most of the java code, the main changes that happened where plumbing through support for the integration test. |
cpp/src/arrow/array/builder_dict.cc
Outdated
@@ -349,6 +355,9 @@ template class DictionaryBuilder<Date64Type>; | |||
template class DictionaryBuilder<Time32Type>; | |||
template class DictionaryBuilder<Time64Type>; | |||
template class DictionaryBuilder<TimestampType>; | |||
template class DictionaryBuilder<DurationIntervalType>; | |||
template class DictionaryBuilder<DayTimeIntervalType>; | |||
template class DictionaryBuilder<MonthIntervalType>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, they were removed.
@wesm I think I addressed all your feedback. The OS/X build seems to in an infinite CMake loop with something like
I'm not sure why my change would have caused this. Any thoughts? @pravindra I was able to consolidate all the java vectors into one, could you take a look? |
2913b09
to
773a2b0
Compare
Baffling |
It must have something to do with this being an older PR. Is Appveyor enabled on your fork? If not I can push this to my fork to get an Appveyor build going |
No appveyor isn't enabled on my fork. let me try squashing everything, see if that helps. |
and mark DayTimeInterval as deprecated. Provides implementation for Duration, DayTimeInterval and YearMonth interval in C++ and Java (for duration). Adds integration test.
Ahh, squash seems to have worked, lets hope things stay green. |
@emkornfield it looks you made some changes to vendored code which I think we should try to avoid doing in general, can we open a JIRA to restore to the vendored version and address the name conflict another way? I don't want this issue to block the patch since I need this to go in and then rebase #4316 on top |
Appveyor appears to be about ~6 hours backed up so I'm going to push a branch on my fork to try to get a quicker thumbs up to merge this |
Here's an Appveyor build starting now on my fork which will take a while to run https://ci.appveyor.com/project/wesm/arrow/builds/24577591 |
Thanks. https://issues.apache.org/jira/browse/ARROW-5346 opened to revert the changes. Right now the only solution seems like reordering includes in some cases or fixing upstream. I will have to think about this. |
Appveyor build looks OK, I'm going to wait a little while yet to make sure the MinGW build works before merging |
@wesm appveyor on your branch looks all green. |
Awesome. Thanks @emkornfield!! |
- Create a new DurationInterval type in format. - Implement all interval types in C++ and Java including integration test. Once this is checked in, I think https://issues.apache.org/jira/browse/ARROW-352 can be resolved as a won't fix? Author: Micah Kornfield <emkornfield@gmail.com> Closes apache#3644 from emkornfield/new_type and squashes the following commits: cc6d39c <Micah Kornfield> remove system clock daf8b4f <Micah Kornfield> duration_type->duration b064816 <Micah Kornfield> Introduce a new Duration type can that represent time deltas, and mark DayTimeInterval as deprecated.
- Create a new DurationInterval type in format. - Implement all interval types in C++ and Java including integration test. Once this is checked in, I think https://issues.apache.org/jira/browse/ARROW-352 can be resolved as a won't fix? Author: Micah Kornfield <emkornfield@gmail.com> Closes apache#3644 from emkornfield/new_type and squashes the following commits: cc6d39c <Micah Kornfield> remove system clock daf8b4f <Micah Kornfield> duration_type->duration b064816 <Micah Kornfield> Introduce a new Duration type can that represent time deltas, and mark DayTimeInterval as deprecated.
Once this is checked in, I think https://issues.apache.org/jira/browse/ARROW-352 can be resolved as a won't fix?