-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
.iterrows takes too long and generate large memory footprint #7683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
what are you doing that requires |
see here for some tips: #7194 |
This does return a generator. The problem is since you have mixed dtypes, it has to create a single dtyped object, BEFORE IT DOES ANYTHING, which takes a lot of time (the zipping doesn't take much). |
Profiling shows it has nothing to do with zipping, though it's not about mixed dtyped data frame either. It's slow when
|
you haven't answered the question why are you using iterrows? |
There's a method I want to apply to each row sequentially. The method itself takes some time so vectorize it or not doesn't make much different for running time. I prefer iteration because it gives more control. |
you might try iterating over I suppose this could be updated to iterate over the index, rather than all at once (as it loses its identity as an Index and becomes a list). Would you like to submit a pull-request for this? |
sure. i can look into this |
that would be gr8!. |
PR submitted #7702 |
When using df.iterrows on large data frame, it takes a long time to run and consumes huge amount of memory.
The name of the function implies that it is an iterator and should not take much to run.
However,in the method it uses builtin method 'zip',which can sometimes generate huge temporary list of tuples if optimisation is not done correctly.Below is the code which can reproduce the issue on a box with 16GB memory.
The text was updated successfully, but these errors were encountered: