-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33749: [Ruby] Add Arrow::RecordBatch#each_raw_record #37137
Conversation
Add Arrow::Table#each_raw_record to make Arrow::Table#raw_records be iterable.
|
n_columns_(n_columns) { | ||
record_(Qnil), | ||
n_columns_(n_columns), | ||
is_produce_mode_(false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about creating RawRecordProducer
instead of adding a new "produce" mode to RawRecordsBuilder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix: 3318abc
Sounds pretty nice to me!
I tried to split RawRecordProducer
into its own source file because some methods had to know the row-wise processing situation.
What do you think of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I should have written RawRecordsProducer
not RawRecordProducer
because it produces multiple raw records.
So I think that we can still use raw-records.cpp
instead of raw-record.cpp
because it also handles multiple raw records not just one raw record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix: d3e5f3f
To enhance clarity and maintainability: - Implemented the new `RawRecordProducer` class, dedicated to processing records on a row-by-row basis. - This specialization eliminates conditional branching within the `convert` method previously present in the RawRecordsBuilder.
Could you try writing |
Because it produces multiple raw records.
ec6f5a7
to
93adaac
Compare
I got it. So as the next step, I will add the tests about ArrowTable#each_raw_record and Arrow::RecordBatch#each_raw_record as the following. |
const auto& chunked_array = table.column(i).get(); | ||
column_index_ = i; | ||
|
||
for (const auto array : chunked_array->chunks()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we miss &
here to avoid needless reference count increment:
for (const auto array : chunked_array->chunks()) { | |
for (const auto& array : chunked_array->chunks()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix: 34b53e0
I changed the logic of void produce(const arrow::Table& table)
. So could you recheck this point again?🙏
|
||
class RawRecordsProducer : private Converter, public arrow::ArrayVisitor { | ||
public: | ||
explicit RawRecordsProducer(int n_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove n_columns
here and define n_columns
in each produce()
?
void produce(const arrow::RecordBatch& record_batch) {
auto n_columns = record_batch->num_columns();
// ...
}
void produce(const arrow::Table& table) {
auto n_columns = table->num_columns();
// ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix: b0731a7
include EachRawRecordBasicArraysTests | ||
|
||
def build(schema, records) | ||
Arrow::Table.new(schema, records) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you create multiple chunked arrays?
Arrow::Table.new(schema, records) | |
Arrow::Table.new(schema, | |
[ | |
Arrow::RecordBatch.new(schema, records[0, 2]), | |
Arrow::RecordBatch.new(schema, records[2..-1]), | |
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix: 34b53e0
Upcoming Tasks
|
Also considered the multiple chunked layout
@kou |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I'll merge this as-is. We can improve this by follow-up tasks. Could you open new issues for the followings?
- Use an empty chunked array for
Arrow::Table
tests - Add support for
table.each_raw_record.to_a
- Unify tests with
raw_records
andeach_raw_record
(If you have more improvement ideas, please open new issues for them. We can discuss further on them.)
include EachRawRecordDenseUnionArrayTests | ||
|
||
def build(type, records) | ||
build_record_batch(type, records).to_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you open a new issue for including an empty chunked array like https://github.com/apache/arrow/pull/37137/files#diff-dadd173e21fd2d82b185504806bba81780b35e1b795ad12794ef06b0eab47ec7R521-R530 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've just created this issue here: #37561
@kou
|
Thanks! |
If you have some issues that you want to work on, please add |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7dd8624. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change This change aligns the behavior of `each_raw_record` with standard Ruby practices by returning an enumerator when no block is provided ### What changes are included in this PR? - Made `Arrow::Table#each_raw_record` and `Arrow::RecordBatch#each_raw_record` return Enumerator when it was called without block. - Added related tests - Resolved warnings related to duplicate test classes which were caused by #37137 ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: #37562 Authored-by: otegami <a.s.takuya1026@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…#37137) ### Rationale for this change This change allows for efficient iteration over large datasets, particularly those utilizing the Apache Parquet format. ### What changes are included in this PR? - Add the following methods to make the raw_records method iterable. - Arrow::RecordBatch#each_raw_record - Arrow::Table#each_raw_record - Add related test ### Are these changes tested? Yes. ### Are there any user-facing changes? No. This PR is related to apache#33749 * Closes: apache#33749 Lead-authored-by: otegami <a.s.takuya1026@gmail.com> Co-authored-by: takuya kodama <a.s.takuya1026@gmail.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…ache#37600) ### Rationale for this change This change aligns the behavior of `each_raw_record` with standard Ruby practices by returning an enumerator when no block is provided ### What changes are included in this PR? - Made `Arrow::Table#each_raw_record` and `Arrow::RecordBatch#each_raw_record` return Enumerator when it was called without block. - Added related tests - Resolved warnings related to duplicate test classes which were caused by apache#37137 ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: apache#37562 Authored-by: otegami <a.s.takuya1026@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…#37137) ### Rationale for this change This change allows for efficient iteration over large datasets, particularly those utilizing the Apache Parquet format. ### What changes are included in this PR? - Add the following methods to make the raw_records method iterable. - Arrow::RecordBatch#each_raw_record - Arrow::Table#each_raw_record - Add related test ### Are these changes tested? Yes. ### Are there any user-facing changes? No. This PR is related to apache#33749 * Closes: apache#33749 Lead-authored-by: otegami <a.s.takuya1026@gmail.com> Co-authored-by: takuya kodama <a.s.takuya1026@gmail.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…ache#37600) ### Rationale for this change This change aligns the behavior of `each_raw_record` with standard Ruby practices by returning an enumerator when no block is provided ### What changes are included in this PR? - Made `Arrow::Table#each_raw_record` and `Arrow::RecordBatch#each_raw_record` return Enumerator when it was called without block. - Added related tests - Resolved warnings related to duplicate test classes which were caused by apache#37137 ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: apache#37562 Authored-by: otegami <a.s.takuya1026@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Rationale for this change
This change allows for efficient iteration over large datasets, particularly those utilizing the Apache Parquet format.
What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
No.
This PR is related to #33749