-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement BaseOffset in tslibs.offsets #18016
Conversation
pandas/_libs/tslibs/offsets.pyx
Outdated
'hours', 'minutes', 'seconds', 'milliseconds', 'microseconds' | ||
]) | ||
|
||
def _determine_offset(kwds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment this is a method of DateOffset
that only gets called in __init__
.
@@ -206,3 +271,109 @@ class ApplyTypeError(TypeError): | |||
# TODO: unused. remove? | |||
class CacheableOffset(object): | |||
_cacheable = True | |||
|
|||
|
|||
class BeginMixin(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BeginMixin and EndMixin are new, each only have the one method. At the moment these methods are in DateOffset, but they are only used by a small handful of FooBegin and BarEnd subclasses.
def __neg__(self): | ||
# Note: we are defering directly to __mul__ instead of __rmul__, as | ||
# that allows us to use methods that can go in a `cdef class` | ||
return self * -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the status quo __neg__
is defined as return self.__class__(-self.n, normalize=self.normalize, **self.kwds)
. By deferring to __mul__
, we move away from the self.kwds
pattern. Ditto for copy
.
pandas/tests/tseries/test_offsets.py
Outdated
@@ -41,6 +41,8 @@ | |||
from pandas.tseries.holiday import USFederalHolidayCalendar | |||
|
|||
|
|||
data_dir = tm.get_data_path() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this call to up here ensures that we get the same data_dir
whether running the tests via pytest or interactively. Under the status quo, copy/pasting the pertinent test below will fail because get_data_path
will not behave as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? we use this pattern everywhere, why are you changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because when I try to run these tests interactively and copy/paste the contents of a test function, tm.get_data_path
returns unexpected results depending on os.getcwd()
. AFAICT when run non-interactively it behaves as if cwd is pandas/tests/tseries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean 'interactively'? you should simply be running
pytest pandas/tests/...... -k ...
or whatever that is the idiomatic way to run tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a test fails and I want to figure out why, I run the contents of the test manually in the REPL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to revert this change; not that big a deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes pls revert.
standard way to run tests is
pytest path/to/test -k optional_regex
lots of options, including --pdb
to drop into the debuger
pls revert this is non-standard
Codecov Report
@@ Coverage Diff @@
## master #18016 +/- ##
==========================================
- Coverage 91.23% 91.21% -0.03%
==========================================
Files 163 163
Lines 50091 50032 -59
==========================================
- Hits 45703 45636 -67
- Misses 4388 4396 +8
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #18016 +/- ##
==========================================
+ Coverage 91.28% 91.41% +0.12%
==========================================
Files 163 163
Lines 50130 50073 -57
==========================================
+ Hits 45761 45772 +11
+ Misses 4369 4301 -68
Continue to review full report at Codecov.
|
|
# --------------------------------------------------------------------- | ||
# Base Classes | ||
|
||
class _BaseOffset(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you creating a base class here? what is the purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IOW why not simply have 1 Base class (and not a _BaseOffset and a BaseOffset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments about remaining cython/pickle issues.
You're absolutely right that in its current form having two separate classes accomplishes nothing. The idea is that _BaseOffset
should be a cdef class
, while BaseOffset
should be python class. (__rfoo__
methods do not play nicely with cython classes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably leave this as a class for the moment. I am not convinced this actually needs to be a full c-extension class (e.g. its not like we are inheriting from a python c-class here). I don't see the benefit and it has added complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason is to achieve immutability. That's the big roadblock between us and making __eq__
, __ne__
, __hash__
performance not-awful. (There's an issue somewhere about "scalar types immutable" or something like that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
pandas/tests/tseries/test_offsets.py
Outdated
@@ -41,6 +41,8 @@ | |||
from pandas.tseries.holiday import USFederalHolidayCalendar | |||
|
|||
|
|||
data_dir = tm.get_data_path() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh? we use this pattern everywhere, why are you changing this?
pandas/tseries/offsets.py
Outdated
@classmethod | ||
def _from_name(cls, suffix=None): | ||
# default _from_name calls cls with no args | ||
if suffix: | ||
raise ValueError("Bad freq suffix {suffix}".format(suffix=suffix)) | ||
raise ValueError("Bad freq suffix %s" % suffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert, we are moving towards new style string formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woops, copy/paste from an older version. Will revert.
def _should_cache(self): | ||
return self.isAnchored() and self._cacheable | ||
|
||
def __repr__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note, the repr is currently used for hashing, but instead should simply define __hash__
I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__hash__
is defined using _params()
which is the god-awful slow thing we need to get rid of.
small comments, and rebase |
@jreback For triaging purposes, this is the only one of my PRs that is blocking non-refactoring work. |
from pandas._libs.tslib import pydt_to_i8 | ||
|
||
from frequencies cimport get_freq_code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update setup.py for this
lgtm ping on green |
TestClipboard.test_round_trip_valid_encodings, otherwise green. Will push a dummy commit anyway. |
Ping |
thanks! |
This moves a handful of methods of
DateOffset
up intotslibs.offsets.BaseOffset
. The focus for now is on arithmetic methods that do not get overridden by subclasses. These use theself.__class__(..., **self.kwds)
pattern that we eventually need to get rid of. Isolating this pattern before suggesting alternatives.The
_BaseOffset
class was intended to be acdef
class, but that leads to errors intest_pickle_v0_15_2
that I haven't figured out yet. Once that gets sorted out, we can makeDateOffset
immutable and see some real speedups via caching.See other comments in-line.