Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement alt var length view for external sort. #4925

Merged
merged 1 commit into from
May 1, 2024

Conversation

lums658
Copy link
Contributor

@lums658 lums658 commented May 1, 2024

This PR implements an alternate view for var length data. Instead of maintaining an offsets range, as in the first var_length_view, this class materializes all of the individual subranges over the var length data. The associated unit tests for this PR include tests for sorting.

Advantages:

  • The alt_var_length_view has the advantage of not returning a proxy when accessed and, as a result, can be directly sorted. Also, not having to create the proxy on access can result in more efficiency.
  • The alt_var_length_view class can be constructed with either the "Arrow" format for offsets or the "TileDB" format. In the latter case, the constructor takes the offsets ranges, as well as the final value.

Disadvanatages:

  • The alt_var_length_view requires more storage. An individual std::ranges::subrange consumes 16 bytes, whereas a size_t consumes 8 bytes. On the other hand, if sorting of the original var_length_view is of interest, the offsets range plus a permutation range will also result in 16 bytes storage per var length element.

Notes:

  • The alt_var_length_view class ends up being a fairly thin wrapper around std::vector<std::ranges::subrange>>. However, it is considered not to be good practice to inherit from std::vector, so it was kept as member data.
  • Because this view does maintain internal data that is not O(1), the copy constructor and copy assignment operator are disabled.

[sc-43636]


TYPE: IMPROVEMENT
DESC: Implement alt var length view for external sort.

This PR implements an alternate view for var length data.  Instead of maintaining an offsets range, as in the first `var_length_view`, this class materializes all of the individual subranges over the var length data.  The associated unit tests for this PR include tests for sorting.

Advantages:
* The `alt_var_length_view` has the advantage of not returning a proxy when accessed and, as a result, can be directly sorted.  Also, not having to create the proxy on access can result in more efficiency.
* The `alt_var_length_view` class can be constructed with either the "Arrow" format for offsets or the "TileDB" format.  In the latter case, the constructor takes the offsets ranges, as well as the final value.

Disadvanatages:
* The `alt_var_length_view` requires more storage.  An individual `std::ranges::subrange` consumes 16 bytes, whereas a `size_t` consumes 8 bytes.  On the other hand, if sorting of the original `var_length_view` is of interest, the offsets range plus a permutation range will also result in 16 bytes storage per var length element.

Notes:
* The `alt_var_length_view` class ends up being a fairly thin wrapper around `std::vector<std::ranges::subrange>>`.  However, it is considered not to be good practice to inherit from `std::vector`, so it was kept as member data.
* Because this view does maintain internal data that is not O(1), the copy constructor and copy assignment operator are disabled.

---
TYPE: IMPROVEMENT
DESC: Implement alt var length view for external sort.
@KiterLuc KiterLuc force-pushed the al/alt-var-length-view/ch43636 branch from 1950a26 to b3d6bb1 Compare May 1, 2024 06:47
@KiterLuc KiterLuc merged commit 7f511fd into dev May 1, 2024
58 checks passed
@KiterLuc KiterLuc deleted the al/alt-var-length-view/ch43636 branch May 1, 2024 07:41
KiterLuc pushed a commit that referenced this pull request May 1, 2024
This PR is stacked on #4925.  It adds additional constructors for the two view classes for var length data.

The previous implementations had constructors that just took input ranges, and constructed a var length view over the data specified by those ranges.  However, in general, we may have a range that is not completely filled with data, and so may need to specify the data to be viewed in some other way.  The complete set of constructors take
* ranges for data and offsets (arrow format) - the size of the resulting view is one less than the size of the offset range
* ranges plus sizes for data and offsets (arrow format) - the size of the resulting view is one less than the size given for the offset range
* iterator pairs for data and offsets (arrow format) - the size of the resulting view is one less than the difference between end and begin of the offsets pair
* iterator pairs for data and offsets, with sizes (arrow format) - the size of the resulting view is one less than the size given for the offset range
* ranges for data and offsets (tilledb format) - the size of the resulting view is equal to the size of the offset range (takes an extra argument for last offset value)
* ranges plus sizes for data and offsets (tiledb format) - the size of the resulting view is equal to the size given for the offset range (takes an extra argument for last offset value)
* iterator pairs for data and offsets (tiledb format) - the size of the resulting view is equal to the difference between end and begin of the offsets pair (takes an extra argument for last offset value)
* iterator pairs for data and offsets, with sizes (tiledb format) - the size of the resulting view is equal than the size given for the offset range (takes an extra argument for last offset value)

---
TYPE: NO_HISTORY
DESC: Additional constructors for external sort classes.
KiterLuc pushed a commit that referenced this pull request May 1, 2024
This PR is stacked on #4925.  It adds additional constructors for the two view classes for var length data.

The previous implementations had constructors that just took input ranges, and constructed a var length view over the data specified by those ranges.  However, in general, we may have a range that is not completely filled with data, and so may need to specify the data to be viewed in some other way.  The complete set of constructors take
* ranges for data and offsets (arrow format) - the size of the resulting view is one less than the size of the offset range
* ranges plus sizes for data and offsets (arrow format) - the size of the resulting view is one less than the size given for the offset range
* iterator pairs for data and offsets (arrow format) - the size of the resulting view is one less than the difference between end and begin of the offsets pair
* iterator pairs for data and offsets, with sizes (arrow format) - the size of the resulting view is one less than the size given for the offset range
* ranges for data and offsets (tilledb format) - the size of the resulting view is equal to the size of the offset range (takes an extra argument for last offset value)
* ranges plus sizes for data and offsets (tiledb format) - the size of the resulting view is equal to the size given for the offset range (takes an extra argument for last offset value)
* iterator pairs for data and offsets (tiledb format) - the size of the resulting view is equal to the difference between end and begin of the offsets pair (takes an extra argument for last offset value)
* iterator pairs for data and offsets, with sizes (tiledb format) - the size of the resulting view is equal than the size given for the offset range (takes an extra argument for last offset value)

---
TYPE: NO_HISTORY
DESC: Additional constructors for external sort classes.
KiterLuc pushed a commit that referenced this pull request May 1, 2024
This PR is stacked on #4925. It adds additional constructors for the two
view classes for var length data.

The previous implementations had constructors that just took input
ranges, and constructed a var length view over the data specified by
those ranges. However, in general, we may have a range that is not
completely filled with data, and so may need to specify the data to be
viewed in some other way. The complete set of constructors take
* ranges for data and offsets (arrow format) - the size of the resulting
view is one less than the size of the offset range
* ranges plus sizes for data and offsets (arrow format) - the size of
the resulting view is one less than the size given for the offset range
* iterator pairs for data and offsets (arrow format) - the size of the
resulting view is one less than the difference between end and begin of
the offsets pair
* iterator pairs for data and offsets, with sizes (arrow format) - the
size of the resulting view is one less than the size given for the
offset range
* ranges for data and offsets (tilledb format) - the size of the
resulting view is equal to the size of the offset range (takes an extra
argument for last offset value)
* ranges plus sizes for data and offsets (tiledb format) - the size of
the resulting view is equal to the size given for the offset range
(takes an extra argument for last offset value)
* iterator pairs for data and offsets (tiledb format) - the size of the
resulting view is equal to the difference between end and begin of the
offsets pair (takes an extra argument for last offset value)
* iterator pairs for data and offsets, with sizes (tiledb format) - the
size of the resulting view is equal than the size given for the offset
range (takes an extra argument for last offset value)

[sc-44576]

---
TYPE: NO_HISTORY
DESC: Additional constructors for external sort classes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants