Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to concat rows of ~55000 CSV files with a cumulative size of 1.4gb, xsv killed by oom_reaper #230

Closed
smabie opened this issue Jul 27, 2020 · 6 comments
Labels

Comments

@smabie
Copy link

smabie commented Jul 27, 2020

Hi, so I have 55161 csv files in a directory (1.csv to 55161.csv) . I'm trying to concat them all with:

xsv cat rows $(ls *.csv | sort -n) -o daily.csv

But xsv is being killed by the oom_reaper after exhausting all of my 32gb of RAM. Does anyone know why this is happening? I wouldn't really expect sxv cat to use very much memory at all, much less over 30gb of memory!

Does anyone know what's going on?

@BurntSushi
Copy link
Owner

BurntSushi commented Jul 27, 2020

Please provide a reproduction. If you can't share the data, then please consider obfuscating or censoring it somehow. Indeed, this command should use very little memory. Its code is very simple and it is implemented in a straight-forward streaming fashion:

xsv/src/cmd/cat.rs

Lines 71 to 84 in 3de6c04

fn cat_rows(&self) -> CliResult<()> {
let mut row = csv::ByteRecord::new();
let mut wtr = Config::new(&self.flag_output).writer()?;
for (i, conf) in self.configs()?.into_iter().enumerate() {
let mut rdr = conf.reader()?;
if i == 0 {
conf.write_headers(&mut rdr, &mut wtr)?;
}
while rdr.read_byte_record(&mut row)? {
wtr.write_byte_record(&row)?;
}
}
wtr.flush().map_err(From::from)
}

The only thing that's required is that each row must fit into memory.

@smabie
Copy link
Author

smabie commented Jul 27, 2020

Okay, here's a link to the tarball: https://drive.google.com/file/d/19UdCh9qFeuZsy1JOYUQvEPl773EVuvVc/view

So, steps to reproduce:

tar xf data.tar.gz
cd data
xsv cat rows *.csv -o out.csv

By looking at top, you'll see that xsv consumes more and more memory until it is killed by the oom_reaper.

Thanks for the help!

@smabie
Copy link
Author

smabie commented Jul 27, 2020

Oh, and:

$ xsv --version
0.13.0

@BurntSushi
Copy link
Owner

Thank you for the easy reproduction! Unfortunately, this is a problem with the argv parser that xsv uses: docopt/docopt.rs#207

At some point, I'd like to move off that parser and use clap instead. But it's a big refactor.

The only work-around available to you, I think, is to chunk it up into multiple xsv processes. The simplest way to do that is with xargs:

$ find ./ -name '*.csv' -print0 | xargs -0 -n1000 xsv cat rows > ../out.csv

@BurntSushi BurntSushi added the bug label Jul 28, 2020
@smabie
Copy link
Author

smabie commented Jul 29, 2020

Thanks, I ended up just using awk instead:

$ awk '(NR==1)||(FNR>1)' $(ls *.csv | sort -n) > daily.csv

Quite elegant! But I digress, Never thought I would see the day when a command-line parser eats all of my ram, It's probably trying to do something far too clever!

@smabie smabie closed this as completed Jul 29, 2020
@BurntSushi
Copy link
Owner

BurntSushi commented Jul 29, 2020

awk can't parse csv correctly, so I'd be careful with that. It assumes the first header record only uses a single line, which might be true in your case but isn't in general.

It's probably trying to do something far too clever!

I wrote the parser and abandoned it ages ago, because of this and other problems. The specific problem is that it uses backtracking to implement the "docopt" style. So it goes exponential in the worst case. I'd say it's decidedly not clever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants