Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Integrate SOMAColumn: Arrow adapter methods, part 1 #3405

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

XanthosXanthopoulos
Copy link
Collaborator

@XanthosXanthopoulos XanthosXanthopoulos commented Dec 6, 2024

This PR replaces the Arrow schema to TileDB schema transformation to use the SOMAColumn create methods.
Also there are a set of new data converters from arrow arrays to std::array for simplification.

This migration also enforces a current domain restriction for string dimensions to libtiledbsoma in addition to the restriction being present only on the R and Python APIs.

@XanthosXanthopoulos XanthosXanthopoulos changed the title Integrate SOMAColumn in Arrow adapter methods [WIP] Integrate SOMAColumn in Arrow adapter methods Part 2 Dec 8, 2024
@XanthosXanthopoulos XanthosXanthopoulos changed the title Integrate SOMAColumn in Arrow adapter methods Part 2 [c++] Integrate SOMAColumn in Arrow adapter methods Part 2 Dec 8, 2024
@XanthosXanthopoulos XanthosXanthopoulos marked this pull request as ready for review December 8, 2024 16:25
@johnkerl johnkerl changed the title [c++] Integrate SOMAColumn in Arrow adapter methods Part 2 [c++] Integrate SOMAColumn in Arrow adapter methods, part 2 Dec 9, 2024
Copy link
Member

@nguyenv nguyenv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what the Skip and Take are for and document it? It looks like Take is the index of the column to retrieve and Skip is relevant only for geometry columns (where it's always 2)?

Also is there a way to use std::variant or a templated type instead of std::any or would that make things too complicated?

Comment on lines 743 to 867
/**
* Return a copy of the data in a specified column of an arrow table.
* Complex column types are supported. The for each sub column are an
* std::array<T, 2> casted as an std::any object.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Return a copy of the data in a specified column of an arrow table.
* Complex column types are supported. The for each sub column are an
* std::array<T, 2> casted as an std::any object.
*/
/**
* Return a copy of the data in a specified column of an arrow table.
* Complex column types are supported. The type for each sub column is
* an std::array<T, 2> casted as an std::any object.
*/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip and Take are used in 2 places with 2 specific sets on values (either Skip=3 and Take=2 or Skip=0 and Take=2) and are independent of the geometry column. Their usage is to extract specific subranges of ArrowArray data and they come in handy during ArrowSchema -> TileDBSchema where the arrow array provided has 5 values per dimension and we only need the last 2 to set the current domain.

As to using std::variant, adding more SOMAColumn types would require changing multiple variants. The use of std::any here is to enable runtime polymorphism and indirectly introduces a runtime type check (via any_cast, make_any) between the templated function and the actual dimension type. std::variant can provide all the above it is just a different style I am open to discuss further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XanthosXanthopoulos to be clear, I'm not asking you to use a different solution. I'm asking to be clear to the reader the (very good!) reasons you chose the solution you're using.

Please take both paragraphs you just wrote, and put them in the code as comments as well. They are highly informative. :)

Copy link
Member

@nguyenv nguyenv Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry if I was unclear too. This explaination is good, and I was just looking for it to also be documented in the code.

@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-column-arrow-integration branch 2 times, most recently from 5485141 to 0e69ed7 Compare December 13, 2024 16:07
@XanthosXanthopoulos XanthosXanthopoulos changed the base branch from xan/sc-59427/soma-column to xan/sc-59427/soma-geometry-column December 13, 2024 16:08
@XanthosXanthopoulos XanthosXanthopoulos changed the title [c++] Integrate SOMAColumn in Arrow adapter methods, part 2 [c++] Integrate SOMAColumn: Arrow adapter methods, part 1 Dec 13, 2024
@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-column-arrow-integration branch from 0e69ed7 to d6d6187 Compare December 13, 2024 16:10
@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-geometry-column branch 4 times, most recently from 8daf17e to a426c7a Compare January 8, 2025 19:04
Base automatically changed from xan/sc-59427/soma-geometry-column to main January 8, 2025 19:38
@XanthosXanthopoulos XanthosXanthopoulos marked this pull request as draft January 14, 2025 14:04
@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-column-arrow-integration branch from d6d6187 to af1f010 Compare January 14, 2025 14:05
Copy link

codecov bot commented Jan 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.32%. Comparing base (7616147) to head (01c3608).
Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3405      +/-   ##
==========================================
+ Coverage   86.22%   86.32%   +0.09%     
==========================================
  Files          55       55              
  Lines        6410     6375      -35     
==========================================
- Hits         5527     5503      -24     
+ Misses        883      872      -11     
Flag Coverage Δ
python 86.32% <ø> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 86.32% <ø> (+0.09%) ⬆️
libtiledbsoma ∅ <ø> (∅)

@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-column-arrow-integration branch from af1f010 to 2401416 Compare January 14, 2025 18:38
@XanthosXanthopoulos XanthosXanthopoulos marked this pull request as ready for review January 14, 2025 18:39
@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-column-arrow-integration branch from 2401416 to 77a01e1 Compare January 15, 2025 12:02
@jp-dark jp-dark requested a review from nguyenv January 15, 2025 14:51
Copy link
Collaborator

@jp-dark jp-dark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the get_table_column_by_name function, but everything else looks good to me. However, someone with more familiarity with the arrow adapter code should look over this as well.

Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @XanthosXanthopoulos !

libtiledbsoma/src/utils/arrow_adapter.cc Show resolved Hide resolved
libtiledbsoma/src/utils/arrow_adapter.cc Outdated Show resolved Hide resolved
libtiledbsoma/src/utils/arrow_adapter.cc Outdated Show resolved Hide resolved
index_column_schema->children[i]->name));
}

if ((*column)->tiledb_dimensions().has_value()) {
Copy link
Member

@johnkerl johnkerl Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit awkward.

  • column needs to be column_it or some such -- it is not a column, it is an iterator
  • Then, const auto column = *column_it after you check that column _it != columns.end()
  • Then the rest of these (*column)->foo become column->foo as they should be

libtiledbsoma/src/utils/arrow_adapter.cc Outdated Show resolved Hide resolved
libtiledbsoma/src/utils/arrow_adapter.h Outdated Show resolved Hide resolved
std::string name);
/**
* Return a copy of the data in a specified column of an arrow table.
* Complex column types are supported. The type for each sub column is an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Complex column types are supported. The type for each sub column is an
* Complex column types are supported. The type for each subcolumn is an

}

/**
* Read a part of am Arrow array to an std::array and cast in to and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Read a part of am Arrow array to an std::array and cast in to and
* Read a part of an Arrow array to an std::array and cast in to an

*
* @example get_table_any_column<3, 2>(array, schema) will return an
* std::array<T, 3> where T is the appropriate type based on the Arrow array
* format casted as an std::any object. The std::array will skip 2 element
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* format casted as an std::any object. The std::array will skip 2 element
* format casted as an std::any object. The std::array will skip 2 elements

REQUIRE(str_range[0] < " ");
REQUIRE(str_external.first <= " ");
REQUIRE(str_range[1] > "~");
// REQUIRE(str_external.second >= "~");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove

@@ -98,4 +98,22 @@ std::vector<uint8_t> cast_bit_to_uint8(ArrowSchema* schema, ArrowArray* array) {
return casted;
}

std::shared_ptr<SOMAColumn> find_column_by_name(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 😎

Comment on lines +912 to +913
* @tparam Take The number of elements to read.
* @tparam Skip The number of elements to skip.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why Take and Skip are template parameters and not simply function arguments. (The latter seems simpler.)

After having read through your code here, the answer seems to be: because you are constructing std::array objects (which are fixed-size) in various places, rather than std::vector (which are not fixed-size).

It's a good reason to have Take and Skip be implemented as template parameters, but I found it non-obvious.

Please come out and state this reasoning, right here, as a comment.

Copy link
Collaborator

@jp-dark jp-dark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the remaining change requests are addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[c++] Add an abstraction layer between SOMA columns and TileDB dimensions and attributes
4 participants