Skip to content

Pandas read_csv out of memory even after adding chunksize #16537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gk13 opened this issue May 30, 2017 · 19 comments
Closed

Pandas read_csv out of memory even after adding chunksize #16537

gk13 opened this issue May 30, 2017 · 19 comments
Labels
IO CSV read_csv, to_csv Usage Question

Comments

@gk13
Copy link

gk13 commented May 30, 2017

Code Sample, a copy-pastable example if possible

dataframe = pandas.read_csv(inputFolder + dataFile, chunksize=1000000, na_values='null', usecols=fieldsToKeep, low_memory=False, header=0, sep='\t')
tables = map(lambda table: TimeMe(foo)(table, categoryExceptions), dataframe)

def foo(table, exceptions):
    """
    Modifies the columns of the dataframe in place to be categories, largely to save space.
    :type table: pandas.DataFrame
    :type exceptions: set columns not to modify.
    :rtype: pandas.DataFrame
    """
    for c in table:
        if c in exceptions:
            continue

        x = table[c]
        if str(x.dtype) != 'category':
            x.fillna('null', inplace=True)
            table[c] = x.astype('category', copy=False)
    return table

Problem description

I have a 34 GB tsv file and I've been reading it using pandas readcsv function with chunksize specified as 1000000. The coomand above works fine with a 8 GB file, but pandas crashes for my 34 GB file, subsequently crashing my iPython notebook.

@gk13 gk13 changed the title Pandas readcsv out of memory even after adding chunksize Pandas read_csv out of memory even after adding chunksize May 30, 2017
@jreback
Copy link
Contributor

jreback commented May 30, 2017

pls show pd.show_versions().

if the above is all you are doing, then it should work. exactly where does it run out of memory?

@gk13
Copy link
Author

gk13 commented May 30, 2017

I'm running it on a Jupyter notebook, and it crashes (Kernel dies) after processing 124 chunks of this data.

@gk13
Copy link
Author

gk13 commented May 30, 2017

There is no Error in the output, the notebook crashes before that

@TomAugspurger
Copy link
Contributor

Not to be pedantic, but are you sure your file is tab-separated? I've had an issue where I passed the wrong separator, and pandas tried to construction a single giant string which blew up memory.

@gk13
Copy link
Author

gk13 commented May 30, 2017

Yes , I verified that too, it's tab separated :)

@gk13
Copy link
Author

gk13 commented May 30, 2017

The same function worked for a 8 GB version of the file

@jreback
Copy link
Contributor

jreback commented May 30, 2017

@gk13 you would have to show more code. It is certainly possible that the reading part is fine, but your chunk processing blows up memory.

@gk13
Copy link
Author

gk13 commented May 30, 2017

Ive updated the code above. It blows up after processing foo 124 times

@jreback
Copy link
Contributor

jreback commented May 30, 2017

this keeps the reference around, you an gc.collect() or better yet is not to even use inplace.

@gk13
Copy link
Author

gk13 commented May 30, 2017

I tried gc.collect() before returning from foo, didn't help.

@gk13
Copy link
Author

gk13 commented May 30, 2017

Any other suggestions?

@gfyoung gfyoung added IO CSV read_csv, to_csv Low-Memory labels Aug 28, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 28, 2017

@gk13 : I'm in agreement with @TomAugspurger that your file could be malformed, as you have not been able to prove that you were able to read this otherwise (then again, what better way is there to do it than with pandas 😄 ).

Why don't you do this:

Instead of reading the entire file into memory, pass in iterator=True with a specified chunksize. Using the returned iterator, called .read() multiple times and see what you get with each chunk. Then we can confirm whether your file is in fact formatted correctly.

@silva-luana
Copy link

I've solved the memory error problem using chunks AND low_memory=False

chunksize = 100000
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize, low_memory=False):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)

@stock-ds
Copy link

stock-ds commented Mar 5, 2019

I've solved the memory error problem using smaller chunks (size 1). It was like 3x slower, but it didn't error out. low_memory=False didn't work

chunksize = 1
chunks = []
for chunk in pd.read_csv('OFMESSAGEARCHIVE.csv', chunksize=chunksize):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)

@MHUNCHO
Copy link

MHUNCHO commented Oct 10, 2019

what does axis=0 do?

@stock-ds
Copy link

axis=0 - add/append new rows
axis=1 - add/append columns

@mroeschke
Copy link
Member

Seems like the debugging efforts in the original question stalled while others have had success with using concat with chunks and low_memory=False. Will tag as a usage question and close.

@16bc
Copy link

16bc commented Apr 11, 2022

I have the same problem with big csv-file (~10gb)
As current row number increases, pd consumes more and more memory.
In example below term with exit code 137 without reaching print()

for chunk in pd.read_csv(filename, sep='\t', quoting=3, chunksize=100,
on_bad_lines='warn', engine='python',
skiprows=332_200_000, # for my 32gb mem & 2gb swap this value - is my error threshold
na_values=r'\N'):
print("test")

I found that it's not the file size that matters, but the row number on which the particular memory size overflows: different files produce an error on the same row numbers. This also excluding incorrect parsing case.
Chunksize no matter. Change Engine no solve this.

@micomahesh1982
Copy link

i do have the same scenario and getting "Python Error: <>, exitCode: <139>" . does anybody have any resolution. kindly help me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Usage Question
Projects
None yet
Development

No branches or pull requests

10 participants