-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] Basic conversion of RecordBatch to Arrow Tensor - add support for row-major #40866
Closed
Tracked by
#40058
Comments
12 tasks
Although this is a new feature and not a bug fix, it is changing the behaviour of a newly introduced feature for 16.0, and therefore I would propose to include it for 16.0 as well, to avoid that we directly make a breaking change in the new feature in the next release. |
jorisvandenbossche
added a commit
that referenced
this issue
Apr 10, 2024
…or - add support for row-major (#40867) ### Rationale for this change The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`. ### What changes are included in this PR? This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works: ```python >>> import pyarrow as pa >>> import numpy as np >>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90] >>> batch = pa.RecordBatch.from_arrays( ... [ ... pa.array(arr1, type=pa.uint16()), ... pa.array(arr2, type=pa.int16()), ... ... ], ["a", "b"] ... ) # Row-major >>> batch.to_tensor() <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (8, 4) >>> batch.to_tensor().to_numpy().flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False # Column-major >>> batch.to_tensor(row_major=False) <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (4, 36) >>> batch.to_tensor(row_major=False).to_numpy().flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False ``` ### Are these changes tested? Yes, in C++ and Python. ### Are there any user-facing changes? No. * GitHub Issue: #40866 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Issue resolved by pull request 40867 |
raulcd
pushed a commit
that referenced
this issue
Apr 11, 2024
…or - add support for row-major (#40867) ### Rationale for this change The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`. ### What changes are included in this PR? This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works: ```python >>> import pyarrow as pa >>> import numpy as np >>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90] >>> batch = pa.RecordBatch.from_arrays( ... [ ... pa.array(arr1, type=pa.uint16()), ... pa.array(arr2, type=pa.int16()), ... ... ], ["a", "b"] ... ) # Row-major >>> batch.to_tensor() <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (8, 4) >>> batch.to_tensor().to_numpy().flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False # Column-major >>> batch.to_tensor(row_major=False) <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (4, 36) >>> batch.to_tensor(row_major=False).to_numpy().flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False ``` ### Are these changes tested? Yes, in C++ and Python. ### Are there any user-facing changes? No. * GitHub Issue: #40866 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
vibhatha
pushed a commit
to vibhatha/arrow
that referenced
this issue
Apr 15, 2024
…w Tensor - add support for row-major (apache#40867) ### Rationale for this change The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`. ### What changes are included in this PR? This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works: ```python >>> import pyarrow as pa >>> import numpy as np >>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90] >>> batch = pa.RecordBatch.from_arrays( ... [ ... pa.array(arr1, type=pa.uint16()), ... pa.array(arr2, type=pa.int16()), ... ... ], ["a", "b"] ... ) # Row-major >>> batch.to_tensor() <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (8, 4) >>> batch.to_tensor().to_numpy().flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False # Column-major >>> batch.to_tensor(row_major=False) <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (4, 36) >>> batch.to_tensor(row_major=False).to_numpy().flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False ``` ### Are these changes tested? Yes, in C++ and Python. ### Are there any user-facing changes? No. * GitHub Issue: apache#40866 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
tolleybot
pushed a commit
to tmct/arrow
that referenced
this issue
May 2, 2024
…w Tensor - add support for row-major (apache#40867) ### Rationale for this change The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`. ### What changes are included in this PR? This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works: ```python >>> import pyarrow as pa >>> import numpy as np >>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90] >>> batch = pa.RecordBatch.from_arrays( ... [ ... pa.array(arr1, type=pa.uint16()), ... pa.array(arr2, type=pa.int16()), ... ... ], ["a", "b"] ... ) # Row-major >>> batch.to_tensor() <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (8, 4) >>> batch.to_tensor().to_numpy().flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False # Column-major >>> batch.to_tensor(row_major=False) <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (4, 36) >>> batch.to_tensor(row_major=False).to_numpy().flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False ``` ### Are these changes tested? Yes, in C++ and Python. ### Are there any user-facing changes? No. * GitHub Issue: apache#40866 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
tolleybot
pushed a commit
to tmct/arrow
that referenced
this issue
May 4, 2024
…w Tensor - add support for row-major (apache#40867) ### Rationale for this change The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`. ### What changes are included in this PR? This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works: ```python >>> import pyarrow as pa >>> import numpy as np >>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90] >>> batch = pa.RecordBatch.from_arrays( ... [ ... pa.array(arr1, type=pa.uint16()), ... pa.array(arr2, type=pa.int16()), ... ... ], ["a", "b"] ... ) # Row-major >>> batch.to_tensor() <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (8, 4) >>> batch.to_tensor().to_numpy().flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False # Column-major >>> batch.to_tensor(row_major=False) <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (4, 36) >>> batch.to_tensor(row_major=False).to_numpy().flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False ``` ### Are these changes tested? Yes, in C++ and Python. ### Are there any user-facing changes? No. * GitHub Issue: apache#40866 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rok
pushed a commit
to tmct/arrow
that referenced
this issue
May 8, 2024
…w Tensor - add support for row-major (apache#40867) ### Rationale for this change The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`. ### What changes are included in this PR? This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works: ```python >>> import pyarrow as pa >>> import numpy as np >>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90] >>> batch = pa.RecordBatch.from_arrays( ... [ ... pa.array(arr1, type=pa.uint16()), ... pa.array(arr2, type=pa.int16()), ... ... ], ["a", "b"] ... ) # Row-major >>> batch.to_tensor() <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (8, 4) >>> batch.to_tensor().to_numpy().flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False # Column-major >>> batch.to_tensor(row_major=False) <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (4, 36) >>> batch.to_tensor(row_major=False).to_numpy().flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False ``` ### Are these changes tested? Yes, in C++ and Python. ### Are there any user-facing changes? No. * GitHub Issue: apache#40866 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
rok
pushed a commit
to tmct/arrow
that referenced
this issue
May 8, 2024
…w Tensor - add support for row-major (apache#40867) ### Rationale for this change The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`. ### What changes are included in this PR? This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works: ```python >>> import pyarrow as pa >>> import numpy as np >>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90] >>> batch = pa.RecordBatch.from_arrays( ... [ ... pa.array(arr1, type=pa.uint16()), ... pa.array(arr2, type=pa.int16()), ... ... ], ["a", "b"] ... ) # Row-major >>> batch.to_tensor() <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (8, 4) >>> batch.to_tensor().to_numpy().flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False # Column-major >>> batch.to_tensor(row_major=False) <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (4, 36) >>> batch.to_tensor(row_major=False).to_numpy().flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False ``` ### Are these changes tested? Yes, in C++ and Python. ### Are there any user-facing changes? No. * GitHub Issue: apache#40866 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
vibhatha
pushed a commit
to vibhatha/arrow
that referenced
this issue
May 25, 2024
…w Tensor - add support for row-major (apache#40867) ### Rationale for this change The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`. ### What changes are included in this PR? This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works: ```python >>> import pyarrow as pa >>> import numpy as np >>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90] >>> batch = pa.RecordBatch.from_arrays( ... [ ... pa.array(arr1, type=pa.uint16()), ... pa.array(arr2, type=pa.int16()), ... ... ], ["a", "b"] ... ) # Row-major >>> batch.to_tensor() <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (8, 4) >>> batch.to_tensor().to_numpy().flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False # Column-major >>> batch.to_tensor(row_major=False) <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (4, 36) >>> batch.to_tensor(row_major=False).to_numpy().flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False ``` ### Are these changes tested? Yes, in C++ and Python. ### Are there any user-facing changes? No. * GitHub Issue: apache#40866 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
This issue is a part of #40058 and adds an option to construct a row-major
Tensor
from aRecordBatch
which is a layout used most often when working with tensors.Component(s)
C++, Python
The text was updated successfully, but these errors were encountered: