Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37487: [C++][Parquet] Dataset: Implement sync ParquetFileFormat::GetReader #37514

Merged
merged 8 commits into from
Sep 19, 2023

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Sep 1, 2023

Rationale for this change

As #37487 says. When thread cnt == 1, the thread might blocking in ParquetFileFormat::GetReaderAsync, that's because:

  1. ParquetFileFormat::CountRows would call EnsureCompleteMetadata in io_executor
  2. EnsureCompleteMetadata call ParquetFileFormat::GetReader, which dispatch real request to async mode
  3. async is executed in io_executor.

1/3 in same fix-sized executor, causing deadlock.

What changes are included in this PR?

Implement sync ParquetFileFormat::GetReader.

Are these changes tested?

Currently not

Are there any user-facing changes?

Bugfix

@mapleFU mapleFU requested a review from westonpace as a code owner September 1, 2023 11:44
@github-actions
Copy link

github-actions bot commented Sep 1, 2023

⚠️ GitHub issue #37487 has been automatically assigned in GitHub to PR creator.

@mapleFU mapleFU force-pushed the dataset/implement-sync-getreader branch 2 times, most recently from 79bb6d0 to 555eb62 Compare September 1, 2023 16:27
@mapleFU mapleFU force-pushed the dataset/implement-sync-getreader branch from 555eb62 to a23d4db Compare September 1, 2023 16:47
@mapleFU mapleFU requested a review from pitrou September 1, 2023 18:17
@mapleFU
Copy link
Member Author

mapleFU commented Sep 5, 2023

I think we already have the test for GetReader

@mapleFU
Copy link
Member Author

mapleFU commented Sep 5, 2023

@pitrou @westonpace would you mind take a look?

@mapleFU
Copy link
Member Author

mapleFU commented Sep 7, 2023

@westonpace By the way, call Blocking Get in same thread pool might cause the deadlock, should we try to find out which would causing this and trying to fix them?

@westonpace
Copy link
Member

I'll try and look in more detail later today but here is the general advice.

Async methods which return a future.
Sync methods do not

User calls always start out as synchronous methods (e.g. we don't have adapters for python's asyncio)

We can convert from synchronous to asynchronous using something like...

... Foo(...) {
    Future<...> fut = FooAsync(...);
    return fut.result();
}

This is ok. One conversion from synchronous to asynchronous is ok and inevitable (since user calls start as sync).

However, it is not ok to convert from asynchronous to synchronous. For example, if Foo is defined as above then this would not be ok:

... BarAsync(...) {
   Future<...> fut = SomeAsyncMethod();
   return fut.Then([&] (...) {
       return Foo(...);
   });
}

This is also not ok:

... BarAsync(...) {
    auto x = Foo(...);
    return SomeAsyncMethod(x, ...);
}

Normally the fix is for BarAsync to call FooAsync.

Creating a Bar that is a synchronous method that mirrors BarAsync is ok, but leads to more code.

At a glance, it looks like that is what is happening here, which is maybe ok. However, I have to look in more detail.

@mapleFU
Copy link
Member Author

mapleFU commented Sep 7, 2023

Thanks for your nice advice!

I think this patch would be OK, I can check more Async usage around dataset.

@mapleFU
Copy link
Member Author

mapleFU commented Sep 11, 2023

@westonpace can we try to review this patch first? Since I believe at least this patch might solve the problem :-)

@mapleFU
Copy link
Member Author

mapleFU commented Sep 14, 2023

@pitrou Would you mind help take a look? Since the logic is easy and simple here...

@mapleFU
Copy link
Member Author

mapleFU commented Sep 18, 2023

ping @pitrou @westonpace @bkietz for help...

@bkietz
Copy link
Member

bkietz commented Sep 18, 2023

This isn't tested and we don't have follow up from the original issue's poster that this patch fixes the deadlock.

@mapleFU could you add (something like) the original repro as a test? It'd be best if we can confirm that main hangs with a single IO thread and use_threads=False, then confirm that we no longer hang with this fix.

# test.py
import pyarrow.dataset as pads

ds = pads.dataset("/tmp/test.parq", format="parquet") # can be any parquet
ds.count_rows(use_threads=False)

ARROW_IO_THREADS=1 python test.py

@yiteng-guo could you try out this patch and confirm it fixes the deadlock for you?

@mapleFU
Copy link
Member Author

mapleFU commented Sep 19, 2023

Hi @bkietz , I've write a test using a extended file-reader, it can reproduce the case.

class AsyncBufferReader : public ::arrow::io::BufferReader {
 public:
  explicit AsyncBufferReader(std::shared_ptr<Buffer> buffer, const ::arrow::io::IOContext& ctx): ::arrow::io::BufferReader(std::move(buffer)), ctx_(ctx) {}
  /// EXPERIMENTAL: Read data asynchronously.
  Future<std::shared_ptr<Buffer>> ReadAsync(const ::arrow::io::IOContext& ctx, int64_t position,
                                            int64_t nbytes) override {
    auto self = checked_pointer_cast<AsyncBufferReader>(shared_from_this());
    return DeferNotOk(ctx.executor()->Submit([self, position, nbytes]() {
      return self->ReadAt(position, nbytes);
    }));
  }

  const ::arrow::io::IOContext& io_context() const override {
    return ctx_;
  }

  private:
   ::arrow::io::IOContext ctx_;
};

However, default io-pool is widely used in scanner, it's a bit hard to hook all io_context( Might need to change EnsureCompleteMetadata to below) :

Status EnsureCompleteMetadata(parquet::arrow::FileReader* reader = NULLPTR, const std::shared_ptr<ScanOptions>& scan_options=NULLPTR);

So I change the global thread pool during testing, would you mind take a look?

@mapleFU mapleFU force-pushed the dataset/implement-sync-getreader branch from 6f07b53 to 6fe5f7e Compare September 19, 2023 05:47
cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved
Comment on lines 392 to 415
::arrow::io::IOContext default_io_context;
auto pool_executor =
dynamic_cast<::arrow::internal::ThreadPool*>(default_io_context.executor());
if (pool_executor == nullptr) {
GTEST_SKIP();
}
auto origin_capacity = pool_executor->GetCapacity();
ASSERT_OK(pool_executor->SetCapacity(/*threads=*/1));

// Reset capacity for pool_executor
struct PoolResetGuard {
PoolResetGuard(::arrow::internal::ThreadPool* pool, int origin_capacity)
: pool(pool), origin_capacity(origin_capacity) {}
~PoolResetGuard() {
Status s = pool->SetCapacity(origin_capacity);
if (!s.ok()) {
std::cerr << "Failed to reset pool capacity: " << s.ToString() << std::endl;
}
}

::arrow::internal::ThreadPool* pool;
int origin_capacity;
};
PoolResetGuard guard(pool_executor, origin_capacity);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use SetIOThreadPoolCapacity and GetIOThreadPoolCapacity here:

Suggested change
::arrow::io::IOContext default_io_context;
auto pool_executor =
dynamic_cast<::arrow::internal::ThreadPool*>(default_io_context.executor());
if (pool_executor == nullptr) {
GTEST_SKIP();
}
auto origin_capacity = pool_executor->GetCapacity();
ASSERT_OK(pool_executor->SetCapacity(/*threads=*/1));
// Reset capacity for pool_executor
struct PoolResetGuard {
PoolResetGuard(::arrow::internal::ThreadPool* pool, int origin_capacity)
: pool(pool), origin_capacity(origin_capacity) {}
~PoolResetGuard() {
Status s = pool->SetCapacity(origin_capacity);
if (!s.ok()) {
std::cerr << "Failed to reset pool capacity: " << s.ToString() << std::endl;
}
}
::arrow::internal::ThreadPool* pool;
int origin_capacity;
};
PoolResetGuard guard(pool_executor, origin_capacity);
// Reset capacity for pool_executor
struct PoolResetGuard {
int original_capacity = io::GetIOThreadPoolCapacity();
~PoolResetGuard() {
DCHECK_OK(io::SetIOThreadPoolCapacity(original_capacity));
}
} guard;

Since this directly mutates the IO thread pool (pointed to by the default IO context), you shouldn't need AsyncBufferReader::ReadAsync

I agree that it's odd BufferReader::ReadAsync and MemoryMappedFile::ReadAsync (the only overrides I see) ignore the io context argument. @lidavidm do you think they should transfer to the other executor if it's different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, firstly I'm trying to change ParquetFileFragment uses same io-executor, but I guess it should be put in another patch.

Copy link
Member Author

@mapleFU mapleFU Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's odd BufferReader::ReadAsync and MemoryMappedFile::ReadAsync (the only overrides I see) ignore the io context argument

I think BufferReader::ReadAsync is so lightweight that we can ignore ioContext?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given both access memory/"memory" I think that's quite intentional

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 19, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 19, 2023
@mapleFU
Copy link
Member Author

mapleFU commented Sep 19, 2023

@bkietz I've tried to resolve the comments, would you mind take a look again?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Sep 19, 2023
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Sep 19, 2023
@bkietz bkietz merged commit 76c4a6e into apache:main Sep 19, 2023
@bkietz bkietz removed the awaiting merge Awaiting merge label Sep 19, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 76c4a6e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…rmat::GetReader` (apache#37514)

### Rationale for this change

As apache#37487 says. When thread cnt == 1, the thread might blocking in `ParquetFileFormat::GetReaderAsync`, that's because:

1. `ParquetFileFormat::CountRows` would call `EnsureCompleteMetadata` in `io_executor`
2. `EnsureCompleteMetadata` call `ParquetFileFormat::GetReader`, which dispatch real request to async mode
3. `async` is executed in `io_executor`.

1/3 in same fix-sized executor, causing deadlock.

### What changes are included in this PR?

Implement sync `ParquetFileFormat::GetReader`.

### Are these changes tested?

Currently not

### Are there any user-facing changes?

Bugfix

* Closes: apache#37487

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…rmat::GetReader` (apache#37514)

### Rationale for this change

As apache#37487 says. When thread cnt == 1, the thread might blocking in `ParquetFileFormat::GetReaderAsync`, that's because:

1. `ParquetFileFormat::CountRows` would call `EnsureCompleteMetadata` in `io_executor`
2. `EnsureCompleteMetadata` call `ParquetFileFormat::GetReader`, which dispatch real request to async mode
3. `async` is executed in `io_executor`.

1/3 in same fix-sized executor, causing deadlock.

### What changes are included in this PR?

Implement sync `ParquetFileFormat::GetReader`.

### Are these changes tested?

Currently not

### Are there any user-facing changes?

Bugfix

* Closes: apache#37487

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] FileFormat::GetReaderAsync hangs when ARROW_IO_THREADS is set to 1
4 participants