-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Implemented lazy iteration #20796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks like the 2.7 failures are relevant. This would need a release note. How's the performance when actually iterating? e.g. df = pd.DataFrame({"A": np.arange(100000)})
list(iter(df.itertuples())) |
Yea, I think I just managed to fix those. I had to import
Where does this go? Any instructions anywhere?
Memory wise: great. It just has to have one row at a time in memory. CPU wise: it looks like around 2-3x slower than Oh, and of course. This version immediately starts returning values while before you had to wait for everything to finish. So latency is much lower, while overall time is around 2-3x. |
Codecov Report
@@ Coverage Diff @@
## master #20796 +/- ##
==========================================
+ Coverage 92.3% 92.3% +<.01%
==========================================
Files 163 163
Lines 51943 51947 +4
==========================================
+ Hits 47946 47951 +5
+ Misses 3997 3996 -1
Continue to review full report at Codecov.
|
Correction. I was testing before just iterating directly on I used this to test both how long it takes to create a list of all tuples, how long it takes to get the first tuple, and how long it takes to get only simple tuples. import numpy as np
import pandas as pd
import time
print(pd.__path__)
def perf(f):
start = time.perf_counter()
f()
end = time.perf_counter()
return end - start
def a():
list(iter(df.itertuples()))
def b():
next(iter(df.itertuples()))
def c():
list(iter(df.itertuples(index=False, name=None)))
df = pd.DataFrame({"A": np.arange(10000000)})
print(sum(perf(a) for i in range(10)) / 10)
print(sum(perf(b) for i in range(10)) / 10)
print(sum(perf(c) for i in range(10)) / 10) Old version:
New version:
|
It is interesting. It seems this even makes the regular (Series version) slightly faster, while simple-tuple version is slower a bit. I could repeat these results even after trying a bit more. |
If you want to measure performance, we also have an asv performance benchmark for this method here: pandas/asv_bench/benchmarks/frame_methods.py Lines 106 to 108 in add3fbf
Here's a guide on how to run the asv benchmark to evaluate performance changes. |
pandas/core/frame.py
Outdated
fields.append("Index") | ||
|
||
# use integer indexing because of possible duplicate column names | ||
arrays.extend(self.iloc[:, k] for k in range(len(self.columns))) | ||
iterators.extend(self.iloc[:, k] for k in range(len(self.columns))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe iterators
can also be evaluated lazily by using itertools.chain
. Something along the lines of:
iterators = itertools.chain([self.index], (self.iloc[:, k] for k in range(len(self.columns)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but I do not think this is necessary, because those are per-column and have to be done immediately afterwards even for the first row. So you move forcing execution from one line in a function to another line in the same function. So later on we do zip(*iterators)
which forces iterators
to be made into a list. If we would manually zip things, then we could go without that. But passing *iterators
forces the list. It is also better that any exception evaluating columns is thrown here and not later inside that other try/catch
.
So I do not think this is necessary.
Tests are failing for some unrelated reason. |
I ran benchmarks, but changes are all over the place. Both positive and negative. For same kind of methods (like rolling) some go up and down. I am not sure how stable are those benchmarks. I have to leave it for few hours to run and could not assure complete idleness of the computer. Also, absolute times are just few ms for many of them. I think this is hard to measure well.
|
What you can do if you suspect a given result is a fluke is to rerun only that benchmarks file, as in:
this should take just a couple of minutes (assuming you just ran another asv test on the same commits), during which you can probably leave your computer idle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once perf tests are run (and asv are added if needed), would need a whatsnew note.
ad99292
to
868f8db
Compare
I think I updated everything requested in the code. I am now running benchmarks once more on an idle machine. |
There already seems to be an |
I added few more asv tests. |
Updated benchmarks:
|
Just iteration, ran again:
|
doc/source/whatsnew/v0.23.1.txt
Outdated
@@ -16,6 +16,8 @@ New features | |||
~~~~~~~~~~~~ | |||
|
|||
- :meth:`Index.droplevel` is now implemented also for flat indexes, for compatibility with MultiIndex (:issue:`21115`) | |||
- Iterating over a :class:`Series` and using :meth:`DataFrame.itertuples` now create iterators without internally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.24.0 performance section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@mitar did you run this backwards? those times have increased |
I ran I am not sure how to determine better what is happening. |
Fixes GH20783.
…1d_sparse benchmark" This reverts commit e2f892a.
@TomAugspurger - Thanks! I've rebased to your commit, I suppose we should be good now. |
I've rerun the benchmark, I assume a lot of changes are due to the fix.
|
can you merge master |
can you add whatsnew note in perf section |
why are these 3 increasing?
|
Previous implementation would load the entire dataset into Python memory and the iterate over it. Now we iterate over the underlying array which reduces read speed. I think this is in line with the previous discussion, we are trying to optimize for (initialization) memory use and initialization time rather than total execution time. The to_list benchmarks run |
|
Thanks @jreback, @mitar and @TomAugspurger ! :) |
Fixes GH20783.
git diff upstream/master -u -- "*.py" | flake8 --diff