-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: speed up concat on Series by skipping unnecessary DataFrame creation #23404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @qwhelan! Thanks for updating the PR.
Comment last updated on November 02, 2018 at 00:32 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #23404 +/- ##
=======================================
Coverage 92.22% 92.22%
=======================================
Files 161 161
Lines 51191 51191
=======================================
Hits 47210 47210
Misses 3981 3981
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this actually affects perf? pls show a before and after
@jreback I was waiting for |
8816e4b
to
b464b65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change just obfuscates things for very very little gain here. This is a constant time operation that we use all over the codebase.
You can try to make this a @staticmethod
instead if you really want to get this in.
@qwhelan doing this just on Py3 also feels hacky. I think everything in diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index db10494f0..14edc1d7f 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -358,18 +358,19 @@ class NDFrame(PandasObject, SelectionMixin):
d.update(kwargs)
return cls(data, **d)
- def _get_axis_number(self, axis):
- axis = self._AXIS_ALIASES.get(axis, axis)
+ @classmethod
+ def _get_axis_number(cls, axis):
+ axis = cls._AXIS_ALIASES.get(axis, axis)
if is_integer(axis):
- if axis in self._AXIS_NAMES:
+ if axis in cls._AXIS_NAMES:
return axis
else:
try:
- return self._AXIS_NUMBERS[axis]
+ return cls._AXIS_NUMBERS[axis]
except KeyError:
pass
raise ValueError('No axis named {0} for object type {1}'
- .format(axis, type(self)))
+ .format(axis, cls))
def _get_axis_name(self, axis):
axis = self._AXIS_ALIASES.get(axis, axis) Will you see if the same change can be made to
Then, will you ensure that we have basic tests for using all those methods on classes (as well as instances). |
@jreback Yeah, I'm aware this change is trivial but just using it as an excuse to get my dev setup fully working for some more substantial perf patches that will be coming soon. @TomAugspurger Thanks, I'll check into it. |
@TomAugspurger I implemented the There's decent existing test coverage but I added an explicit test of class vs instance invocations for all valid possible calls for good measure. |
Thanks, LGTM, though you have a linting error: |
@TomAugspurger Thanks, resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment, ping on green.
lgtm. ping on green. |
@jreback All green, thanks! |
so we're down to 2x. Let me know if you're interested in pushing this further @qwhelan (in a separate PR). |
thanks @qwhelan |
@TomAugspurger Yeah, I'm looking at some related issues if you'd like to keep that issue open for now. |
…xamples * repo_org/master: (66 commits) CLN: doc string (pandas-dev#23469) DOC: Add cookbook entry for triangular correlation matrix (GH22840) (pandas-dev#23032) add number of Errors, Warnings to scripts/validate_docstrings.py (pandas-dev#23150) BUG: Allow freq conversion from dt64 to period (pandas-dev#23460) ENH: Add FrozenList.union and .difference (pandas-dev#23394) REF: cython cleanup, typing, optimizations (pandas-dev#23464) strictness and checks for Timedelta _simple_new (pandas-dev#23433) Fixing flake8 problems new to flake8 3.6.0 (pandas-dev#23472) DOC: Updating the docstring of Series.dot (pandas-dev#22890) TST: Fixturize series/test_analytics.py (pandas-dev#22755) BUG/ENH: Handle NonexistentTimeError in date rounding (pandas-dev#23406) PERF: speed up concat on Series by making _get_axis_number() a classmethod (pandas-dev#23404) REF: Remove DatetimelikeArrayMixin._shallow_copy (pandas-dev#23430) REF: strictness/simplification in DatetimeArray/Index _simple_new (pandas-dev#23431) REF: cython cleanup, typing, optimizations (pandas-dev#23456) TST: tweak Hypothesis configuration and idioms (pandas-dev#23441) BUG: fix HDFStore.append with all empty strings error (GH12242) (pandas-dev#23435) TST: Skip 32bit failing IntervalTree tests (pandas-dev#23442) BUG: Deprecate nthreads argument (pandas-dev#23112) style: fix import format at pandas/core/reshape (pandas-dev#23387) ...
Removes an unnecessary
DataFrame
creation when dealing solely withSeries
objects, which reduces runtime ofconcat
.git diff upstream/master -u -- "*.py" | flake8 --diff
Performance Comparison
Baseline
After
And the excised code itself:
So roughly 40% of the runtime was being spent mapping
axis=0 -> axis=0