[C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size

### Describe the bug, including details regarding any error messages, version, and platform.

When reading a Parquet file with `FileReader::GetRecordBatchReader` and default options, memory usage will increase while iterating over batches and reach the size of all data in the file.

This may be expected behaviour, but it was quite surprising to me so I want to open this issue to discuss whether this can be improved.

I have my test code in a branch on my fork: https://github.com/apache/arrow/compare/main...adamreeve:arrow:mem_use_test

Writing a test file with 100 row groups of 40 MB each (4 GB total):
<details>

```c++
TEST(TestStreamFile, WriteFile) {
  const std::string file_path = "/tmp/stream_test.parquet";
  constexpr int64_t num_row_groups = 100;
  constexpr int64_t rows_per_row_group = 1000000;
  constexpr int64_t num_columns = 10;

  PARQUET_ASSIGN_OR_THROW(
    const std::shared_ptr<::arrow::io::FileOutputStream> file, ::arrow::io::FileOutputStream::Open(file_path));

  WriterProperties::Builder writer_properties_builder;
  auto writer_properties = writer_properties_builder.build();

  std::vector<NodePtr> fields;
  for (auto col_idx = 0; col_idx < num_columns; ++col_idx) {
    fields.push_back(PrimitiveNode::Make("x" + std::to_string(col_idx), Repetition::REQUIRED, Type::FLOAT));
  }
  auto schema = std::dynamic_pointer_cast<schema::GroupNode>(schema::GroupNode::Make(
      "root", Repetition::REQUIRED, fields));
  std::unique_ptr<ParquetFileWriter> writer = ParquetFileWriter::Open(file, schema, writer_properties, nullptr);

  std::vector<float> buffer(rows_per_row_group);
  for (auto row_group_idx = 0; row_group_idx < num_row_groups; ++row_group_idx) {
    auto row_group = writer->AppendRowGroup();
    for (auto col_idx = 0; col_idx < num_columns; ++col_idx) {
      ::arrow::random_real(rows_per_row_group, row_group_idx * num_columns + col_idx, -1.0, 1.0, &buffer);
      auto column_writer = row_group->NextColumn();
      auto& float_column_writer = dynamic_cast<FloatWriter&>(*column_writer);
      float_column_writer.WriteBatch(rows_per_row_group, nullptr, nullptr, buffer.data());
    }
    row_group->Close();
  }


  writer->Close();
} 
```

</details>

Reading the file:
<details>

```c++
TEST(TestStreamFile, ReadFile) {
  const std::string file_path = "/tmp/stream_test.parquet";
  PARQUET_ASSIGN_OR_THROW(
    std::shared_ptr<::arrow::io::ReadableFile> input_file, ::arrow::io::ReadableFile::Open(file_path, ::arrow::default_memory_pool()));

  ReaderProperties reader_properties;
  ArrowReaderProperties arrow_reader_properties;
  //arrow_reader_properties.set_pre_buffer(false);

  FileReaderBuilder builder;
  PARQUET_THROW_NOT_OK(builder.Open(input_file, reader_properties));
  builder.properties(arrow_reader_properties);

  int batchesRead = 0;
  int64_t maxRss = 0;
  {
    std::unique_ptr<FileReader> reader;
    PARQUET_THROW_NOT_OK(builder.Build(&reader));

    PARQUET_ASSIGN_OR_THROW(
      std::shared_ptr<::arrow::RecordBatchReader> batch_reader, reader->GetRecordBatchReader());

    while (true) {
      std::shared_ptr<::arrow::RecordBatch> batch;
      PARQUET_THROW_NOT_OK(batch_reader->ReadNext(&batch));
      if (batch == nullptr) {
        break;
      }
      int64_t rss = ::arrow::internal::GetCurrentRSS();
      std::cout << "Batch " << batchesRead << ", RSS = " << (rss / (double)(1024 * 1024)) << " MB" << std::endl;
      maxRss = std::max(maxRss, rss);
      batchesRead++;
    }
  }
```

</details>

When running the read test, the output looks like:
```
Batch 0, RSS = 1151.82 MB
Batch 1, RSS = 1151.82 MB
...
Batch 318, RSS = 1151.82 MB
Batch 319, RSS = 1151.82 MB
Batch 320, RSS = 2175.82 MB
Batch 321, RSS = 2175.82 MB
...
Batch 669, RSS = 2175.82 MB
Batch 670, RSS = 2175.82 MB
Batch 671, RSS = 3199.82 MB
Batch 672, RSS = 3199.82 MB
...
Batch 1005, RSS = 3199.82 MB
Batch 1006, RSS = 3199.82 MB
Batch 1007, RSS = 4223.82 MB
Batch 1008, RSS = 4223.82 MB
...
Batch 1340, RSS = 4223.82 MB
Batch 1341, RSS = 4223.82 MB
Batch 1342, RSS = 5247.82 MB
Batch 1343, RSS = 5247.82 MB
...
Batch 1524, RSS = 5247.82 MB
Batch 1525, RSS = 5247.82 MB
Read 1526 batches
Max RSS = 5247.82 MB
```

From some experimenting, I found that disabling pre-buffering (uncommenting [this line](https://github.com/adamreeve/arrow/blob/bfd05037fce201ead24cadd7b1bb9cc7a09f56d7/cpp/src/parquet/arrow/arrow_reader_writer_test.cc#L876)) greatly reduces memory use:
```
Batch 0, RSS = 1079.81 MB
Batch 1, RSS = 1079.8 MB
...
Batch 1524, RSS = 1079.8 MB
Batch 1525, RSS = 1079.8 MB
Read 1526 batches
Max RSS = 1079.81 MB
```

This memory use still looked a bit high to me, but the max RSS reported by `/usr/bin/time -v` was a lot lower, at about 94 MB and 4.7 GB with pre-buffering. 

From looking at the code, I can see there is a [cache of futures of buffers](https://github.com/apache/arrow/blob/f8cd17c0651e4886a08b2664ec8e0a0fff09eaa2/cpp/src/arrow/io/caching.cc#L155) in the `ReadRangeCache::Impl`.

Unless I'm missing something, it looks like once a buffer is stored in this cache, it's never removed, which explains the memory usage behaviour. Should buffers be evicted from this cache once they've been read to reduce memory usage?

### Component(s)

C++, Parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size #46935

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Parquet] Iterating over Parquet RecordBatchReader uses memory equivalent to whole file size #46935

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions