Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] Too many open files (os error 24) #47

Closed
alamb opened this issue Apr 26, 2021 · 11 comments
Closed

[Parquet] Too many open files (os error 24) #47

alamb opened this issue Apr 26, 2021 · 11 comments
Labels
parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-6154

Used [rust]parquet-read binary to read a deeply nested parquet file and see the below stack trace. Unfortunately won't be able to upload file.
{code:java}
stack backtrace:

   0: std::panicking::default_hook::{{closure}}

   1: std::panicking::default_hook

   2: std::panicking::rust_panic_with_hook

   3: std::panicking::continue_panic_fmt

   4: rust_begin_unwind

   5: core::panicking::panic_fmt

   6: core::result::unwrap_failed

   7: parquet::util::io::FileSource::new

   8: <parquet::file::reader::SerializedRowGroupReader as parquet::file::reader::RowGroupReader>::get_column_page_reader

   9: <parquet::file::reader::SerializedRowGroupReader as parquet::file::reader::RowGroupReader>::get_column_reader

  10: parquet::record::reader::TreeBuilder::reader_tree

  11: parquet::record::reader::TreeBuilder::reader_tree

  12: parquet::record::reader::TreeBuilder::reader_tree

  13: parquet::record::reader::TreeBuilder::reader_tree

  14: parquet::record::reader::TreeBuilder::reader_tree

  15: parquet::record::reader::TreeBuilder::build

  16: <parquet::record::reader::RowIter as core::iter::traits::iterator::Iterator>::next

  17: parquet_read::main

  18: std::rt::lang_start::{{closure}}

  19: std::panicking::try::do_call

  20: __rust_maybe_catch_panic

  21: std::rt::lang_start_internal

  22: main{code}

@alamb alamb added the arrow Changes to the arrow crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Chao Sun(csun) @ 2019-08-07T06:02:08.709+0000:

Thanks for reporting. Do you have rough idea how deep the nested data type is? is there any error message? would be great if we can reproduce this.

Comment from Yesh(madras) @ 2019-08-07T11:35:10.840+0000:

Thanks for ack. Below is the error message.  Additional data point is that it is able to dump schema via parquet-schema . 
{code:java}
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("underlying IO error: Too many open files (os error 24)")', src/libcore/result.rs:1084:5{code}

Comment from Ahmed Riza(dr.riza@gmail.com) @ 2021-02-12T22:52:01.045+0000:

I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.]  I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#000000}.{color}

Here's a stack trace from `gdb` which leads to the call in `io.rs`.   This can be reproduced by using the attached Parquet file.

One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors (one for each column in the Parquet file).

This is the initial stack trace when the footer is first read.  `FileSource::new` (in io.rs) gets called for every column subsequently as well when reading the columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`)

 
{code:java}
#0  parquet::util::io::FileSource::new (fd=0x7ffff7c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82

#1  0x00005555558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x7ffff7c3fafc, start=807191, length=65536)

    at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59

#2  0x000055555590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x7ffff7c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57

#3  0x0000555555845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...)

    at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134

#4  0x0000555555845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81

#5  0x0000555555845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7ffff0000d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90

#6  0x0000555555845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-00001-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet")

    at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98

#7  0x000055555577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103


 {code}

@alamb alamb added parquet Changes to the parquet crate and removed arrow Changes to the arrow crate labels Apr 26, 2021
@capkurmagati
Copy link

I also encountered the issue when I using InfluxDB IOx and MinIo.

Aug 18 23:40:50.688 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aug 18 23:40:50.690 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.690 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.690 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Parquet reader thread terminated due to error: IoError(Os { code: 24, kind: Other, message: "Too many open files" })
Aug 18 23:40:50.691 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.691 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.694 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.694 ERROR panic_logging: thread 'IOx Query Executor Thread' panicked at 'UNKNOWN', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
thread 'IOx Query Executor Thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Other, message: "Too many open files" }', /Users/jason/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-5.2.0/src/util/io.rs:82:50
Aug 18 23:40:50.695  INFO influxdb_iox::influxdb_ioxd::rpc::flight: err=Query { database_name: "local_database", source: DataFusionExecution { source: ArrowError(ExternalError(CreatingParquetReader { source: IoError(Os { code: 24, kind: Other, message: "Too many open files" }) })) } } msg="Error handling Flight gRPC request"

Additional info:
The directory structure follows. Each directory has 1-3 parquet files.

s3cmd ls s3://sensors/1/local_database/data/cpu/
                          DIR  s3://sensors/1/local_database/data/cpu/2021-08-17 12:00:00/
                          DIR  s3://sensors/1/local_database/data/cpu/2021-08-17 13:00:00/
                          DIR  s3://sensors/1/local_database/data/cpu/2021-08-17 14:00:00/
                          DIR  s3://sensors/1/local_database/data/cpu/2021-08-17 15:00:00/
20+ more dirs...

@alamb
Copy link
Contributor Author

alamb commented Aug 18, 2021

@capkurmagati I wonder if you might be able to work around the issue by raising the maximum number of open files. Something like

ulimit -n 20000

@capkurmagati
Copy link

@alamb Yes, it works. Actually I tweaked the value a bit before I posted here trying to reproduce the error. However I observed that the engine does not always return an error for a certain table scan. (I used a limit clause to try to control the files to read) . So I thought it might be a debug. On second thought, I guess the engine may cache the data so that it doesn't scan the same amount of the files. So my problem is unrelated here. Thanks.

@alamb
Copy link
Contributor Author

alamb commented Aug 19, 2021

I think something else that might be related is the fact that (currently) DataFusion execution tries to start all partitions concurrently. This means that depending on how fast IO comes in and the details of the Tokio scheduler, sometimes it will have far too many open files at once (it might end up opening 100 input parquet files, for example, even if there are only 8 cores available for processing) -- @andygrove has mentioned the Ballista scheduler is more sophisticated in this area and hopefully we can move some of those improvements down into the core DataFusion engine

@andygrove
Copy link
Member

That's right. Ballista avoids this issue by limiting the number of concurrent tasks. However, Ballista has its own related issues where it will generate an excessive number of shuffle files and potentially run into inode limits, so neither solution is as scalable as we would like yet.

@jorgecarleitao
Copy link
Member

I have not read the code base thoroughly, but I remember something like when I skimmed through it:

this may be related to the fact that AFAIK we currently clone the file reader for every new seek. Thus, even for a single file, we usually open that file multiple times, once per seek (which is roughly one per row group and per column chunk plus one, to read the metadata).

@Dandandan
Copy link
Contributor

For this as well as #924: a good start might be to start limiting the number of maximum threads that are used for spawn_blocking tasks, by default there are max 512 concurrent threads for those:

See:

https://docs.rs/tokio/1.10.0/tokio/index.html#cpu-bound-tasks-and-blocking-code

@XeCycle
Copy link

XeCycle commented Sep 6, 2021

I'm getting the same (I believe) error on files with many columns (>2k), and FWIW, can be work'd-around by a struct ArcFile(Arc<File>). Just impl parquet::file::reader::ChunkReader for ArcFile, using FileExt::read_at on cfg(unix) and FileExt::seek_read on cfg(windows).

I guess we can modify parquet::util::io::FileSource to use Arc<File>, because we take ownership of provided file objects anyway.

@jinyius
Copy link

jinyius commented Sep 8, 2022

any update here as it's been a year? i can provide some test parquet files that triggers this issue if that helps.

@alamb
Copy link
Contributor Author

alamb commented Sep 10, 2022

Hi @jinyius

There has been some non trivial work by @tustvold to support reading parquet files without having to clone filehandles -- e.g. https://docs.rs/parquet/22.0.0/parquet/file/serialized_reader/struct.SerializedFileReader.html now takes a ChunkReader which is implemented on Bytes.

https://docs.rs/parquet/22.0.0/parquet/file/reader/trait.ChunkReader.html

Thus, in order to read such a file, you can buffer it into Bytes https://docs.rs/parquet/22.0.0/parquet/file/reader/trait.ChunkReader.html#impl-ChunkReader-for-Bytes

Perhaps with something like this (untested):

let mut v = vec![];
let parquet_file: File = open_your_parquet_file();
// read parquet into memory (TODO error checking)
parquet_file.read_to_end(&mut v).unwrap();

// convert to Bytes so we can read the file 
let b: Bytes = v.into();
let reader = SerializedFileReader::new(b).unwrap();

any update here as it's been a year? i can provide some test parquet files that triggers this issue if that helps.

If you could provide an example file and the code you are using that shows the error, I would be happy to help try and apply the method above. If it works for you, I think we should update the documentation to explain this

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

8 participants