-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet] Too many open files (os error 24) #47
Comments
Comment from Chao Sun(csun) @ 2019-08-07T06:02:08.709+0000: Thanks for reporting. Do you have rough idea how deep the nested data type is? is there any error message? would be great if we can reproduce this. Comment from Yesh(madras) @ 2019-08-07T11:35:10.840+0000: Thanks for ack. Below is the error message. Additional data point is that it is able to dump schema via parquet-schema . {code:java} thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: General("underlying IO error: Too many open files (os error 24)")', src/libcore/result.rs:1084:5{code} Comment from Ahmed Riza(dr.riza@gmail.com) @ 2021-02-12T22:52:01.045+0000: I've come across the same error. In my case it appears to be due to the `try_clone` calls in [https://github.com/apache/arrow/blob/master/rust/parquet/src/util/io.rs#L82.] I have a Parquet file with 3000 columns (see attached example), and the `try_clone` calls here eventually fail as it ends up creating too many open file descriptors{color:#000000}.{color} Here's a stack trace from `gdb` which leads to the call in `io.rs`. This can be reproduced by using the attached Parquet file. One could increase the `ulimit -n` on Linux to get around this, but not really a solution, since the code path ends up just creating potentially a very large number of open file descriptors (one for each column in the Parquet file). This is the initial stack trace when the footer is first read. `FileSource::new` (in io.rs) gets called for every column subsequently as well when reading the columns (see {color:#cc844f}fn {color}{color:#8ec1ff}reader_tree {color}in `parquet/record/reader.rs`) {code:java} #0 parquet::util::io::FileSource::new (fd=0x7ffff7c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/util/io.rs:82 #1 0x00005555558294ce in parquet::file::serialized_reader::{{impl}}::get_read (self=0x7ffff7c3fafc, start=807191, length=65536) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:59 #2 0x000055555590a3fc in parquet::file::footer::parse_metadata (chunk_reader=0x7ffff7c3fafc) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/footer.rs:57 #3 0x0000555555845db1 in parquet::file::serialized_reader::SerializedFileReader::new (chunk_reader=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:134 #4 0x0000555555845bb6 in parquet::file::serialized_reader::{{impl}}::try_from (file=...) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:81 #5 0x0000555555845c4a in parquet::file::serialized_reader::{{impl}}::try_from (path=0x7ffff0000d20) at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:90 #6 0x0000555555845d34 in parquet::file::serialized_reader::{{impl}}::try_from (path="resources/parquet/part-00001-33e6c49b-d6cb-4175-bc41-7198fd777d3a-c000.snappy.parquet") at /home/a/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-3.0.0/src/file/serialized_reader.rs:98 #7 0x000055555577c7f5 in data_rust::parquet::parquet_demo::test::test_read_multiple_files () at /work/rust/data-rust/src/parquet/parquet_demo.rs:103 {code} |
I also encountered the issue when I using
Additional info:
|
@capkurmagati I wonder if you might be able to work around the issue by raising the maximum number of open files. Something like
|
@alamb Yes, it works. Actually I tweaked the value a bit before I posted here trying to reproduce the error. However I observed that the engine does not always return an error for a certain table scan. (I used a |
I think something else that might be related is the fact that (currently) DataFusion execution tries to start all partitions concurrently. This means that depending on how fast IO comes in and the details of the Tokio scheduler, sometimes it will have far too many open files at once (it might end up opening 100 input parquet files, for example, even if there are only 8 cores available for processing) -- @andygrove has mentioned the Ballista scheduler is more sophisticated in this area and hopefully we can move some of those improvements down into the core DataFusion engine |
That's right. Ballista avoids this issue by limiting the number of concurrent tasks. However, Ballista has its own related issues where it will generate an excessive number of shuffle files and potentially run into inode limits, so neither solution is as scalable as we would like yet. |
I have not read the code base thoroughly, but I remember something like when I skimmed through it: this may be related to the fact that AFAIK we currently clone the file reader for every new seek. Thus, even for a single file, we usually open that file multiple times, once per seek (which is roughly one per |
For this as well as #924: a good start might be to start limiting the number of maximum threads that are used for See: https://docs.rs/tokio/1.10.0/tokio/index.html#cpu-bound-tasks-and-blocking-code |
I'm getting the same (I believe) error on files with many columns (>2k), and FWIW, can be work'd-around by a I guess we can modify |
any update here as it's been a year? i can provide some test parquet files that triggers this issue if that helps. |
Hi @jinyius There has been some non trivial work by @tustvold to support reading parquet files without having to clone filehandles -- e.g. https://docs.rs/parquet/22.0.0/parquet/file/serialized_reader/struct.SerializedFileReader.html now takes a https://docs.rs/parquet/22.0.0/parquet/file/reader/trait.ChunkReader.html Thus, in order to read such a file, you can buffer it into Perhaps with something like this (untested): let mut v = vec![];
let parquet_file: File = open_your_parquet_file();
// read parquet into memory (TODO error checking)
parquet_file.read_to_end(&mut v).unwrap();
// convert to Bytes so we can read the file
let b: Bytes = v.into();
let reader = SerializedFileReader::new(b).unwrap();
If you could provide an example file and the code you are using that shows the error, I would be happy to help try and apply the method above. If it works for you, I think we should update the documentation to explain this |
Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-6154
Used [rust]parquet-read binary to read a deeply nested parquet file and see the below stack trace. Unfortunately won't be able to upload file.
{code:java}
stack backtrace:
0: std::panicking::default_hook::{{closure}}
1: std::panicking::default_hook
2: std::panicking::rust_panic_with_hook
3: std::panicking::continue_panic_fmt
4: rust_begin_unwind
5: core::panicking::panic_fmt
6: core::result::unwrap_failed
7: parquet::util::io::FileSource::new
8: <parquet::file::reader::SerializedRowGroupReader as parquet::file::reader::RowGroupReader>::get_column_page_reader
9: <parquet::file::reader::SerializedRowGroupReader as parquet::file::reader::RowGroupReader>::get_column_reader
10: parquet::record::reader::TreeBuilder::reader_tree
11: parquet::record::reader::TreeBuilder::reader_tree
12: parquet::record::reader::TreeBuilder::reader_tree
13: parquet::record::reader::TreeBuilder::reader_tree
14: parquet::record::reader::TreeBuilder::reader_tree
15: parquet::record::reader::TreeBuilder::build
16: <parquet::record::reader::RowIter as core::iter::traits::iterator::Iterator>::next
17: parquet_read::main
18: std::rt::lang_start::{{closure}}
19: std::panicking::try::do_call
20: __rust_maybe_catch_panic
21: std::rt::lang_start_internal
22: main{code}
The text was updated successfully, but these errors were encountered: