-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41749: [GLib] Allow getting a RecordBatchReader from a Dataset or Scanner #41750
Conversation
|
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
I'm not sure how I've broken the tests on MinGW, I'll need to figure out what's going wrong there.
I guess something is keeping a file open and preventing the directory contents from being deleted. |
Could you try this? diff --git a/c_glib/test/dataset/test-file-system-dataset.rb b/c_glib/test/dataset/test-file-system-dataset.rb
index a64022b618..d4bf567397 100644
--- a/c_glib/test/dataset/test-file-system-dataset.rb
+++ b/c_glib/test/dataset/test-file-system-dataset.rb
@@ -63,7 +63,11 @@ class TestDatasetFileSystemDataset < Test::Unit::TestCase
def test_record_batch_reader
dataset, expected_table = create_dataset
reader = dataset.to_record_batch_reader
- assert_equal(expected_table, reader.read_all)
+ begin
+ assert_equal(expected_table, reader.read_all)
+ ensure
+ reader.close
+ end
end
def create_dataset
diff --git a/c_glib/test/dataset/test-scanner.rb b/c_glib/test/dataset/test-scanner.rb
index 714760d6a3..1a50b7fe68 100644
--- a/c_glib/test/dataset/test-scanner.rb
+++ b/c_glib/test/dataset/test-scanner.rb
@@ -47,6 +47,11 @@ class TestDatasetScanner < Test::Unit::TestCase
end
def test_record_batch_reader
- assert_equal(@table, @scanner.to_record_batch_reader.read_all)
+ reader = @scanner.to_record_batch_reader
+ begin
+ assert_equal(@table, reader.read_all)
+ ensure
+ reader.close
+ end
end
end |
That doesn't work because there aren't bindings for the I guess for now I could manually cleanup the temporary directory and use the Adding a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work?
diff --git a/c_glib/test/dataset/test-file-system-dataset.rb b/c_glib/test/dataset/test-file-system-dataset.rb
index ac67c6d61d..972c2d2ef9 100644
--- a/c_glib/test/dataset/test-file-system-dataset.rb
+++ b/c_glib/test/dataset/test-file-system-dataset.rb
@@ -21,8 +21,7 @@ class TestDatasetFileSystemDataset < Test::Unit::TestCase
def setup
omit("Arrow Dataset is required") unless defined?(ArrowDataset)
- tmpdir = Dir.mktmpdir
- begin
+ Dir.mktmpdir do |tmpdir|
@dir = tmpdir
@format = ArrowDataset::IPCFileFormat.new
@factory = ArrowDataset::FileSystemDatasetFactory.new(@format)
@@ -33,14 +32,6 @@ class TestDatasetFileSystemDataset < Test::Unit::TestCase
ArrowDataset::DirectoryPartitioning.new(partitioning_schema)
@factory.partitioning = @partitioning
yield
- ensure
- # We have to ignore errors trying to remove the directory due to
- # the RecordBatchReader not closing files
- # (https://github.com/apache/arrow/issues/41771).
- # Also request GC first which should free any remaining RecordBatchReader
- # and close open files.
- GC.start
- FileUtils.remove_entry(tmpdir, force: true)
end
end
@@ -72,7 +63,11 @@ class TestDatasetFileSystemDataset < Test::Unit::TestCase
def test_record_batch_reader
dataset, expected_table = create_dataset
reader = dataset.to_record_batch_reader
- assert_equal(expected_table, reader.read_all)
+ begin
+ assert_equal(expected_table, reader.read_all)
+ ensure
+ reader.unref
+ end
end
def create_dataset
diff --git a/c_glib/test/dataset/test-scanner.rb b/c_glib/test/dataset/test-scanner.rb
index ef834674cb..fe3478a866 100644
--- a/c_glib/test/dataset/test-scanner.rb
+++ b/c_glib/test/dataset/test-scanner.rb
@@ -21,8 +21,7 @@ class TestDatasetScanner < Test::Unit::TestCase
def setup
omit("Arrow Dataset is required") unless defined?(ArrowDataset)
- tmpdir = Dir.mktmpdir
- begin
+ Dir.mktmpdir do |tmpdir|
path = File.join(tmpdir, "table.arrow")
@table = build_table(visible: [
build_boolean_array([true, false, true]),
@@ -40,14 +39,6 @@ class TestDatasetScanner < Test::Unit::TestCase
builder = @dataset.begin_scan
@scanner = builder.finish
yield
- ensure
- # We have to ignore errors trying to remove the directory due to
- # the RecordBatchReader not closing files
- # (https://github.com/apache/arrow/issues/41771).
- # Also request GC first which should free any remaining RecordBatchReader
- # and close open files.
- GC.start
- FileUtils.remove_entry(tmpdir, force: true)
end
end
@@ -56,6 +47,11 @@ class TestDatasetScanner < Test::Unit::TestCase
end
def test_record_batch_reader
- assert_equal(@table, @scanner.to_record_batch_reader.read_all)
+ reader = @scanner.to_record_batch_reader
+ begin
+ assert_equal(@table, reader.read_all)
+ ensure
+ reader.unref
+ end
end
end
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Yes that works thanks, that's much better. Although it looks like the unref call may be unnecessary after #41824, so should I wait for that to be merged then remove the unref call? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We can use unref
for now because #41824 may be rejected.
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 1c9e393. There were 4 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 148 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
See #41749
What changes are included in this PR?
Adds
to_reader
methods toGADatasetDataset
andGADatasetScanner
.Are these changes tested?
Yes I've added new unit tests.
Are there any user-facing changes?
Yes this is a new feature.