Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Intervals #258

Merged
merged 12 commits into from
Jul 20, 2023
Merged

Add Support for Intervals #258

merged 12 commits into from
Jul 20, 2023

Conversation

WillAyd
Copy link
Contributor

@WillAyd WillAyd commented Jul 17, 2023

No description provided.

enum ArrowIntervalUnit unit;

/// \brief Pointer containing the underlying data for the interval
void* data;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly a union is preferable here to avoid an extra pointer indirection?

@codecov-commenter
Copy link

codecov-commenter commented Jul 17, 2023

Codecov Report

Merging #258 (9e37cc3) into main (a209bab) will decrease coverage by 0.03%.
The diff coverage is 84.84%.

@@            Coverage Diff             @@
##             main     #258      +/-   ##
==========================================
- Coverage   87.17%   87.15%   -0.03%     
==========================================
  Files          66       63       -3     
  Lines       10061     9753     -308     
==========================================
- Hits         8771     8500     -271     
+ Misses       1290     1253      -37     
Impacted Files Coverage Δ
src/nanoarrow/nanoarrow.h 100.00% <ø> (ø)
src/nanoarrow/array_inline.h 89.31% <80.76%> (-0.41%) ⬇️
src/nanoarrow/nanoarrow_types.h 93.10% <100.00%> (+0.34%) ⬆️

... and 4 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I have one suggestion about simplifying the ArrowInterval but am excited to have support for this!

src/nanoarrow/array_test.cc Outdated Show resolved Hide resolved
src/nanoarrow/nanoarrow.h Outdated Show resolved Hide resolved
Comment on lines 675 to 686
/// \brief A representation of an interval.
struct ArrowInterval {
/// \brief The type of interval being used
enum ArrowIntervalUnit unit;

/// \brief Pointer containing the underlying data for the interval
union {
int32_t year_month;
struct ArrowIntervalDayTime day_time;
struct ArrowIntervalMonthDayNano month_day_nano;
} data;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ArrowDecimal we do something more along the lines of "declare more fields than needed since the caller probably stack-allocated one of these for whatever loop they're in". The definition here is OK too, but for the sake of argument:

struct ArrowInterval {
  enum ArrowType type;
  int32_t month;
  int32_t day;
  int64_t time;
  int64_t nano;
};

...might be simpler?

Copy link
Contributor Author

@WillAyd WillAyd Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time is supposed to represent milliseconds in that struct right? Not a strong preference but I think using the unions is a little easier as an end user. With the current design I was thinking an anonymous union would be the best, though that would require C11 instead of just C99

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being can we parallel the ArrowDecimal pattern (i.e., one flat struct + an ArrowIntervalInit() helper) for consistency? There are almost certainly better names than month/day/type/nano that perhaps give some hint as to how the spec defines them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea makes sense. Actually looking in CI I don't think Cython supports using a union within a struct, so this is probably the only way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with flattening to one struct I think we will still need to keep structs like ArrowIntervalDayTime and ArrowIntervalMonthDayNano to unpack the data buffer - is that in line with your expectation or am I misinterpreting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I see reinterpret_cast<const struct ArrowIntervalMonthDayNano*> and friends. In #214 we opted to memcpy() instead of reinterpret_cast<> ( #214 (comment) ), although I'm aware we reinterpret_cast<> and make it easy for others to do so with the int/float/double primitives. I think the idea was to not continue that for anything more exotic than an int64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we still need something to memcpy() into though...I'm much more interested in having interval support than having the struct be exactly like ArrowDecimal (although ArrowIntervalUnit can probably just be ArrowType unless there's another distinction I'm missing?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - simplified that

@WillAyd
Copy link
Contributor Author

WillAyd commented Jul 18, 2023

Any idea what I might be overlooking with the tests and how they allocate memory? Locally I get this valgrind traceback:

==710574==    at 0x4843828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==710574==    by 0x1C6BFD: ArrowMalloc (utils.c:194)
==710574==    by 0x1BF55A: ArrowArrayInitFromType (array.c:139)
==710574==    by 0x143D36: ArrayTest_ArrayTestAppendToIntervalArrayMonthDayNano_Test::TestBody() (array_test.cc:934)
==710574==    by 0x2048CA: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2607)
==710574==    by 0x1FC940: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2643)
==710574==    by 0x1D002F: testing::Test::Run() (gtest.cc:2682)
==710574==    by 0x1D0AFD: testing::TestInfo::Run() (gtest.cc:2861)
==710574==    by 0x1D1426: testing::TestSuite::Run() (gtest.cc:3015)
==710574==    by 0x1E13F4: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:5855)
==710574==    by 0x2058C9: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2607)
==710574==    by 0x1FDAD2: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (gtest.cc:2643)

Not immediately clear what is different from the other tests in terms of building the array

@WillAyd
Copy link
Contributor Author

WillAyd commented Jul 18, 2023

Ignore comment above - I see that ImportArray takes ownership of the memory so probably have to implement that or explicitly call array.release

@paleolimbot
Copy link
Member

Yes, you'll need to call array.release(&array) at the end of all three tests you added. A better design would have been to use the C++ helpers (i.e., nanoarrow::UniqueArray and friends); however, I didn't have those when I started writing the tests.

@paleolimbot
Copy link
Member

The Python issue is because we auto-generate the .pxd file from the nanoarrow.h header and in Cython you have to declare the union separately (see https://github.com/apache/arrow-nanoarrow/blob/main/src/nanoarrow/nanoarrow_types.h#L483-L496 ).

@WillAyd
Copy link
Contributor Author

WillAyd commented Jul 19, 2023

I don't think the Python error is related?

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the Python failure is happening on main too (maybe Cython 3 releated).

Just a few nits!

Comment on lines 671 to 679
static inline void ArrowIntervalInit(struct ArrowInterval* interval, enum ArrowType unit,
int32_t months, int32_t days, int32_t ms,
int64_t ns) {
interval->unit = unit;
interval->months = months;
interval->days = days;
interval->ms = ms;
interval->ns = ns;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can just zero-initialize and set unit and encourage direct field access like ArrowDecimal? Most users won't use all the fields at once.

int32_t days;
int32_t ms;
int64_t ns;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

I think this will need the fields documented for them to show up in the generated documentation (and \ingroup nanoarrow-utils for it to show up in the right place).

We have a lot of ArrowType variables that are named type...would that be appropriate here?

dist/nanoarrow.h Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert the change to this file? I should add the hook to modify it to pre-commit but right now it's updated via a nightly job (for better or worse).

paleolimbot added a commit that referenced this pull request Jul 19, 2023
As noted in #258, the Python tests are failing! I'm not sure the details
of why this started now, but `StopIteration` was being raised in a
generator which should just return instead.
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

(I should have merged this before the pre-commit one...I'll take care of that 😬 )

@paleolimbot paleolimbot merged commit 9e5a991 into apache:main Jul 20, 2023
26 checks passed
paleolimbot pushed a commit that referenced this pull request Aug 1, 2023
@paleolimbot paleolimbot modified the milestone: nanoarrow 0.4.0 Sep 21, 2023
@paleolimbot paleolimbot added this to the nanoarrow 0.3.0 milestone Sep 21, 2023
@WillAyd WillAyd deleted the add-interval branch September 5, 2024 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants