-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusing memory usage with CSV reader #623
Comments
What the CSV reader does in the CSV parser is reusing some allocations over time in a batch to reduce allocations / time. However, with a very small batch size of 10, this won't cause the high memory usage, but the data and metadata around a single So generally
|
Ahh, I think this was where a majority of my confusion was coming from - I should have had something after the The only bit that remains a mystery to me is: why does a giant batch size cause the process to use so much RAM? With the tweaked example: use arrow::record_batch::RecordBatch;
use arrow::error::ArrowError;
fn hmm() -> Vec<Result<RecordBatch, ArrowError>> {
let args: Vec<String> = std::env::args().collect();
let fname = &args[1];
let batch_size: usize = args[2].parse().unwrap();
let f = std::fs::File::open(&fname).unwrap();
let reader = arrow::csv::ReaderBuilder::new()
.infer_schema(Some(5_000))
.has_header(true)
.with_batch_size(batch_size)
.build(f).unwrap();
reader.collect()
}
fn main() {
let batches = hmm();
let mut total = 0;
let mut total_bytes = 0;
for r in &batches {
let batch = r.as_ref().unwrap();
for c in batch.columns() {
total_bytes += c.get_array_memory_size();
}
total += batch.num_rows();
}
dbg!(total);
dbg!(total_bytes);
// Delay to measure process RAM usage
let mut input = String::new();
std::io::stdin().read_line(&mut input);
// Repeat
for r in &batches {
let batch = r.as_ref().unwrap();
for c in batch.columns() {
total_bytes += c.get_array_memory_size();
}
total += batch.num_rows();
}
dbg!(2, total);
} ..I get the following results:
The size reported by However the process RAM seems to do the inverse to what I'd expect - it's like something is leaking from the parser, or an array is being over allocated, or something like that? |
I ran the example code with the If I understand right, this explains the remaining mystery (why giant batch size causes process to use lots of memory). In my "ELI5 level" knowledge of memory allocators:
I might try and make a PR to add some basic docs to the Thanks @Dandandan ! |
Describe the bug
Using the
arrow::csv::ReaderBuilder
with something likeworldcitiespop_mil.csv
mentioned on this pageI was experimenting with the batch size setting in a standalone script, and it impacted the RAM usage in a surprising way:
If I run it like so:
..according to
top | grep arrcsv
the RAM usage is something like 5MB.If I increase
10
to100,000
the RAM usage goes to maybe 30MB. Add another zero and the RAM usage is 255MB.Not being too familiar with arrow, I would have expected:
However the opposite seems to be true, and the usage seems kind of oddly high and, mainly, unpredictable.
While making this minimal example, I had a thought that maybe the
arrow::csv::Reader
was still being kept around and it was using the memory, not theVec<RecordBatch>
- so I refactored it into a method, had it return theRecordBatch
so the reader should have been dropped....but even more surprisingly, the memory usage drastically increased:
With this change:
To Reproduce
main.rs
as one of my terrible lumps of code above. Only dependency isarrow = "5.0.0"
cargo +1.53 --release -- ./worldcitiespop_mil.csv 1000
etctop | grep ...
- thus the stdin-reading line in the code)Expected behavior
Mostly covered above - but basically I'd expect the memory usage with all of these combinations to be "quite similar"
Additional context
I've not used arrow much, so it's very much possible I'm doing something strange or incorrect!
Versions of stuff:
The text was updated successfully, but these errors were encountered: