GH-33749: [Ruby] Add Arrow::RecordBatch#each_raw_record #37137

otegami · 2023-08-12T08:41:14Z

Rationale for this change

This change allows for efficient iteration over large datasets, particularly those utilizing the Apache Parquet format.

What changes are included in this PR?

Add the following methods to make the raw_records method iterable.
- Arrow::RecordBatch#each_raw_record
- Arrow::Table#each_raw_record
Add related test

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

This PR is related to #33749

Closes: [Ruby] Add Arrow::RecordBatch#each_raw_record #33749

Add Arrow::Table#each_raw_record to make Arrow::Table#raw_records be iterable.

github-actions · 2023-08-12T08:41:40Z

⚠️ GitHub issue #33749 has been automatically assigned in GitHub to PR creator.

kou · 2023-08-12T21:03:46Z

ruby/red-arrow/ext/arrow/raw-records.cpp

-          n_columns_(n_columns) {
+          record_(Qnil),
+          n_columns_(n_columns),
+          is_produce_mode_(false) {


How about creating RawRecordProducer instead of adding a new "produce" mode to RawRecordsBuilder?

fix: 3318abc
Sounds pretty nice to me!
I tried to split RawRecordProducer into its own source file because some methods had to know the row-wise processing situation.
What do you think of it?

Ah, I should have written RawRecordsProducer not RawRecordProducer because it produces multiple raw records.

So I think that we can still use raw-records.cpp instead of raw-record.cpp because it also handles multiple raw records not just one raw record.

fix: d3e5f3f

To enhance clarity and maintainability: - Implemented the new `RawRecordProducer` class, dedicated to processing records on a row-by-row basis. - This specialization eliminates conditional branching within the `convert` method previously present in the RawRecordsBuilder.

kou · 2023-08-15T02:54:21Z

Could you try writing EachRawRecordBasicArraysTest#test_boolean into ruby/red-arrow/test/each-raw-record/test-basic-arrays.rb?
It will show what should we do as the next step.

Because it produces multiple raw records.

otegami · 2023-08-17T12:33:01Z

Could you try writing EachRawRecordBasicArraysTest#test_boolean into ruby/red-arrow/test/each-raw-record/test-basic-arrays.rb?
It will show what should we do as the next step.

I got it. So as the next step, I will add the tests about ArrowTable#each_raw_record and Arrow::RecordBatch#each_raw_record as the following.
fix: 93adaac

…ordBasicArraysTests

kou · 2023-08-18T01:57:37Z

ruby/red-arrow/ext/arrow/raw-records.cpp

+              const auto& chunked_array = table.column(i).get();
+              column_index_ = i;
+
+              for (const auto array : chunked_array->chunks()) {


Ah, we miss & here to avoid needless reference count increment:

Suggested change

for (const auto array : chunked_array->chunks()) {

for (const auto& array : chunked_array->chunks()) {

fix: 34b53e0
I changed the logic of void produce(const arrow::Table& table). So could you recheck this point again?🙏

kou · 2023-08-18T02:03:40Z

ruby/red-arrow/ext/arrow/raw-records.cpp

+
+    class RawRecordsProducer : private Converter, public arrow::ArrayVisitor {
+    public:
+      explicit RawRecordsProducer(int n_columns)


Can we remove n_columns here and define n_columns in each produce()?

void produce(const arrow::RecordBatch& record_batch) { auto n_columns = record_batch->num_columns(); // ... } void produce(const arrow::Table& table) { auto n_columns = table->num_columns(); // ... }

fix: b0731a7

kou · 2023-08-18T02:47:08Z

ruby/red-arrow/test/each-raw-record/test-basic-arrays.rb

+  include EachRawRecordBasicArraysTests
+
+  def build(schema, records)
+    Arrow::Table.new(schema, records)


Could you create multiple chunked arrays?

Suggested change

Arrow::Table.new(schema, records)

Arrow::Table.new(schema,

[

Arrow::RecordBatch.new(schema, records[0, 2]),

Arrow::RecordBatch.new(schema, records[2..-1]),

])

fix: 34b53e0

otegami · 2023-08-26T08:43:49Z

Also considered the multiple chunked layout

otegami · 2023-09-04T11:52:56Z

@kou
I added the test cases like what we did with raw_records. (It means I don't change the test cases which is used in Arrow::Table#raw_records and Arrow::RecordBatch#raw_records.)
On the other hand, I'm concerned that the code differences are too much.
If it's better to split the PR for each test case, I'd like to make those adjustments.

kou

+1

I'll merge this as-is. We can improve this by follow-up tasks. Could you open new issues for the followings?

Use an empty chunked array for Arrow::Table tests
Add support for table.each_raw_record.to_a
Unify tests with raw_records and each_raw_record

(If you have more improvement ideas, please open new issues for them. We can discuss further on them.)

kou · 2023-09-05T01:31:53Z

ruby/red-arrow/test/each-raw-record/test-dense-union-array.rb

+  include EachRawRecordDenseUnionArrayTests
+
+  def build(type, records)
+    build_record_batch(type, records).to_table


Could you open a new issue for including an empty chunked array like https://github.com/apache/arrow/pull/37137/files#diff-dadd173e21fd2d82b185504806bba81780b35e1b795ad12794ef06b0eab47ec7R521-R530 ?

Sure, I've just created this issue here: #37561

otegami · 2023-09-05T12:38:54Z

@kou
Thank you for reviewing. I've opened the following issues. Please let me know if there's anything missing.
Also, if I come up with an improvement idea, I will open it.

Use an empty chunked array for Arrow::Table tests

#37561

Add support for table.each_raw_record.to_a

#37562

Unify tests with raw_records and each_raw_record

#37563

kou · 2023-09-06T01:38:24Z

Thanks!

kou · 2023-09-06T01:39:17Z

If you have some issues that you want to work on, please add take comment on them and assign them to you.

conbench-apache-arrow · 2023-09-06T12:26:26Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7dd8624.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

### Rationale for this change This change aligns the behavior of `each_raw_record` with standard Ruby practices by returning an enumerator when no block is provided ### What changes are included in this PR? - Made `Arrow::Table#each_raw_record` and `Arrow::RecordBatch#each_raw_record` return Enumerator when it was called without block. - Added related tests - Resolved warnings related to duplicate test classes which were caused by #37137 ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: #37562 Authored-by: otegami <a.s.takuya1026@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

…#37137) ### Rationale for this change This change allows for efficient iteration over large datasets, particularly those utilizing the Apache Parquet format. ### What changes are included in this PR? - Add the following methods to make the raw_records method iterable. - Arrow::RecordBatch#each_raw_record - Arrow::Table#each_raw_record - Add related test ### Are these changes tested? Yes. ### Are there any user-facing changes? No. This PR is related to apache#33749 * Closes: apache#33749 Lead-authored-by: otegami <a.s.takuya1026@gmail.com> Co-authored-by: takuya kodama <a.s.takuya1026@gmail.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

…ache#37600) ### Rationale for this change This change aligns the behavior of `each_raw_record` with standard Ruby practices by returning an enumerator when no block is provided ### What changes are included in this PR? - Made `Arrow::Table#each_raw_record` and `Arrow::RecordBatch#each_raw_record` return Enumerator when it was called without block. - Added related tests - Resolved warnings related to duplicate test classes which were caused by apache#37137 ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: apache#37562 Authored-by: otegami <a.s.takuya1026@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

…#37137) ### Rationale for this change This change allows for efficient iteration over large datasets, particularly those utilizing the Apache Parquet format. ### What changes are included in this PR? - Add the following methods to make the raw_records method iterable. - Arrow::RecordBatch#each_raw_record - Arrow::Table#each_raw_record - Add related test ### Are these changes tested? Yes. ### Are there any user-facing changes? No. This PR is related to apache#33749 * Closes: apache#33749 Lead-authored-by: otegami <a.s.takuya1026@gmail.com> Co-authored-by: takuya kodama <a.s.takuya1026@gmail.com> Co-authored-by: Sutou Kouhei <kou@cozmixng.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

…ache#37600) ### Rationale for this change This change aligns the behavior of `each_raw_record` with standard Ruby practices by returning an enumerator when no block is provided ### What changes are included in this PR? - Made `Arrow::Table#each_raw_record` and `Arrow::RecordBatch#each_raw_record` return Enumerator when it was called without block. - Added related tests - Resolved warnings related to duplicate test classes which were caused by apache#37137 ### Are these changes tested? Yes. ### Are there any user-facing changes? No * Closes: apache#37562 Authored-by: otegami <a.s.takuya1026@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

[Ruby] Add Arrow::Table#each_raw_record for iterable raw record access

8dd6ff6

Add Arrow::Table#each_raw_record to make Arrow::Table#raw_records be iterable.

otegami requested a review from kou as a code owner August 12, 2023 08:41

github-actions bot added Component: Ruby awaiting review Awaiting review labels Aug 12, 2023

kou reviewed Aug 12, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Aug 12, 2023

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 14, 2023

otegami requested a review from kou August 14, 2023 12:25

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 15, 2023

Refactor RawRecordsProducer for Row-Wise Processing in raw-records.cpp

d3e5f3f

Because it produces multiple raw records.

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 16, 2023

Add EachRawRecordBasicArraysTest#test_boolean

93adaac

otegami force-pushed the ruby-arrow-table-each-raw-record branch from ec6f5a7 to 93adaac Compare August 17, 2023 12:25

otegami added 3 commits August 18, 2023 07:41

Add the additonal test cases about the other data types to EachRawRec…

a16c4c4

…ordBasicArraysTests

Implemented RecordBatch#each_raw_record

b354011

Fix typo test_tring -> test_string

92b4b76

kou reviewed Aug 18, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 18, 2023

otegami added 2 commits August 18, 2023 20:33

Avoid needless reference count increment

66ea580

Refactor n_columns definition in produce methods

b0731a7

github-actions bot removed the awaiting changes Awaiting changes label Aug 19, 2023

otegami added 2 commits August 28, 2023 07:32

Add test cases about dense union arrays

98a3599

Add test cases about dictionary arrays

516ef62

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 28, 2023

otegami added 7 commits August 31, 2023 07:43

Add test cases about list arrays

7050dcc

Add test cases about map arrays

72cc0b4

Add test cases about multiple columns

4456002

Add test cases about sparse union arrays

e1e1cd2

Add test cases about struct arrays

c7f4135

Add test cases about table

595f0c9

Improved the test case for multiple columns

3f173ff

Also considered the multiple chunked layout

otegami requested a review from kou September 4, 2023 11:53

kou approved these changes Sep 5, 2023

View reviewed changes

kou merged commit 7dd8624 into apache:main Sep 5, 2023
10 checks passed

kou removed the awaiting change review Awaiting change review label Sep 5, 2023

github-actions bot added the awaiting merge Awaiting merge label Sep 5, 2023

otegami deleted the ruby-arrow-table-each-raw-record branch September 5, 2023 01:59

This was referenced Sep 6, 2023

GH-37562: [Ruby] Add support for table.each_raw_record.to_a otegami/arrow#3

Closed

GH-37562: [Ruby] Add support for table.each_raw_record.to_a #37600

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-33749: [Ruby] Add Arrow::RecordBatch#each_raw_record #37137

GH-33749: [Ruby] Add Arrow::RecordBatch#each_raw_record #37137

otegami commented Aug 12, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Aug 12, 2023

kou Aug 12, 2023

otegami Aug 14, 2023

kou Aug 15, 2023

otegami Aug 17, 2023

kou commented Aug 15, 2023

otegami commented Aug 17, 2023

kou Aug 18, 2023

otegami Aug 19, 2023

kou Aug 18, 2023

otegami Aug 19, 2023

kou Aug 18, 2023

otegami Aug 19, 2023

otegami commented Aug 26, 2023 •

edited

Loading

otegami commented Sep 4, 2023 •

edited

Loading

kou left a comment

kou Sep 5, 2023

otegami Sep 5, 2023

otegami commented Sep 5, 2023

kou commented Sep 6, 2023

kou commented Sep 6, 2023

conbench-apache-arrow bot commented Sep 6, 2023

	for (const auto array : chunked_array->chunks()) {
	for (const auto& array : chunked_array->chunks()) {

GH-33749: [Ruby] Add Arrow::RecordBatch#each_raw_record #37137

GH-33749: [Ruby] Add Arrow::RecordBatch#each_raw_record #37137

Conversation

otegami commented Aug 12, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Aug 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kou commented Aug 15, 2023

otegami commented Aug 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

otegami commented Aug 26, 2023 • edited Loading

otegami commented Sep 4, 2023 • edited Loading

kou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

otegami commented Sep 5, 2023

kou commented Sep 6, 2023

kou commented Sep 6, 2023

conbench-apache-arrow bot commented Sep 6, 2023

otegami commented Aug 12, 2023 •

edited by github-actions bot

Loading

otegami commented Aug 26, 2023 •

edited

Loading

otegami commented Sep 4, 2023 •

edited

Loading