Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BaseOffset in tslibs.offsets #18016

Merged
merged 15 commits into from
Nov 8, 2017

Conversation

jbrockmendel
Copy link
Member

This moves a handful of methods of DateOffset up into tslibs.offsets.BaseOffset. The focus for now is on arithmetic methods that do not get overridden by subclasses. These use the self.__class__(..., **self.kwds) pattern that we eventually need to get rid of. Isolating this pattern before suggesting alternatives.

The _BaseOffset class was intended to be a cdef class, but that leads to errors in test_pickle_v0_15_2 that I haven't figured out yet. Once that gets sorted out, we can make DateOffset immutable and see some real speedups via caching.

See other comments in-line.

'hours', 'minutes', 'seconds', 'milliseconds', 'microseconds'
])

def _determine_offset(kwds):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment this is a method of DateOffset that only gets called in __init__.

@@ -206,3 +271,109 @@ class ApplyTypeError(TypeError):
# TODO: unused. remove?
class CacheableOffset(object):
_cacheable = True


class BeginMixin(object):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BeginMixin and EndMixin are new, each only have the one method. At the moment these methods are in DateOffset, but they are only used by a small handful of FooBegin and BarEnd subclasses.

def __neg__(self):
# Note: we are defering directly to __mul__ instead of __rmul__, as
# that allows us to use methods that can go in a `cdef class`
return self * -1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the status quo __neg__ is defined as return self.__class__(-self.n, normalize=self.normalize, **self.kwds). By deferring to __mul__, we move away from the self.kwds pattern. Ditto for copy.

@@ -41,6 +41,8 @@
from pandas.tseries.holiday import USFederalHolidayCalendar


data_dir = tm.get_data_path()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this call to up here ensures that we get the same data_dir whether running the tests via pytest or interactively. Under the status quo, copy/pasting the pertinent test below will fail because get_data_path will not behave as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? we use this pattern everywhere, why are you changing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because when I try to run these tests interactively and copy/paste the contents of a test function, tm.get_data_path returns unexpected results depending on os.getcwd(). AFAICT when run non-interactively it behaves as if cwd is pandas/tests/tseries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean 'interactively'? you should simply be running

pytest pandas/tests/...... -k ... or whatever that is the idiomatic way to run tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a test fails and I want to figure out why, I run the contents of the test manually in the REPL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to revert this change; not that big a deal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes pls revert.

standard way to run tests is

pytest path/to/test -k optional_regex

lots of options, including --pdb to drop into the debuger

pls revert this is non-standard

@codecov
Copy link

codecov bot commented Oct 29, 2017

Codecov Report

Merging #18016 into master will decrease coverage by 0.02%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18016      +/-   ##
==========================================
- Coverage   91.23%   91.21%   -0.03%     
==========================================
  Files         163      163              
  Lines       50091    50032      -59     
==========================================
- Hits        45703    45636      -67     
- Misses       4388     4396       +8
Flag Coverage Δ
#multiple 89.02% <93.75%> (-0.02%) ⬇️
#single 40.22% <93.75%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 97.11% <93.75%> (-0.05%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2d0d1b...8eb131e. Read the comment docs.

@codecov
Copy link

codecov bot commented Oct 29, 2017

Codecov Report

Merging #18016 into master will increase coverage by 0.12%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18016      +/-   ##
==========================================
+ Coverage   91.28%   91.41%   +0.12%     
==========================================
  Files         163      163              
  Lines       50130    50073      -57     
==========================================
+ Hits        45761    45772      +11     
+ Misses       4369     4301      -68
Flag Coverage Δ
#multiple 89.21% <100%> (+0.14%) ⬆️
#single 40.32% <100%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/tseries/offsets.py 97.11% <100%> (-0.05%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.05%) ⬇️
pandas/core/internals.py 94.54% <0%> (+0.07%) ⬆️
pandas/io/formats/format.py 96.01% <0%> (+0.07%) ⬆️
pandas/core/panel.py 97.28% <0%> (+0.28%) ⬆️
pandas/core/common.py 93% <0%> (+1.82%) ⬆️
pandas/core/generic.py 95.72% <0%> (+3.3%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a2d0eed...61bf134. Read the comment docs.

@jbrockmendel
Copy link
Member Author

taskset 4 asv continuous -E virtualenv -f 1.1 master HEAD -b timeseries
[...]
        before           after         ratio
     [b2d0d1bf]       [a728d995]
-     56.4±0.07μs       50.7±0.3μs     0.90  timeseries.SemiMonthOffset.time_begin_decr
-       152±0.7ms        135±0.5ms     0.89  timeseries.ToDatetime.time_iso8601_tz_spaceformat
-     1.79±0.01ms      1.54±0.01ms     0.86  frame_methods.frame_assign_timeseries_index.time_frame_assign_timeseries_index

# ---------------------------------------------------------------------
# Base Classes

class _BaseOffset(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you creating a base class here? what is the purpose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW why not simply have 1 Base class (and not a _BaseOffset and a BaseOffset)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments about remaining cython/pickle issues.

You're absolutely right that in its current form having two separate classes accomplishes nothing. The idea is that _BaseOffset should be a cdef class, while BaseOffset should be python class. (__rfoo__ methods do not play nicely with cython classes).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably leave this as a class for the moment. I am not convinced this actually needs to be a full c-extension class (e.g. its not like we are inheriting from a python c-class here). I don't see the benefit and it has added complexity.

Copy link
Member Author

@jbrockmendel jbrockmendel Oct 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason is to achieve immutability. That's the big roadblock between us and making __eq__, __ne__, __hash__ performance not-awful. (There's an issue somewhere about "scalar types immutable" or something like that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -41,6 +41,8 @@
from pandas.tseries.holiday import USFederalHolidayCalendar


data_dir = tm.get_data_path()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? we use this pattern everywhere, why are you changing this?

@classmethod
def _from_name(cls, suffix=None):
# default _from_name calls cls with no args
if suffix:
raise ValueError("Bad freq suffix {suffix}".format(suffix=suffix))
raise ValueError("Bad freq suffix %s" % suffix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert, we are moving towards new style string formatting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, copy/paste from an older version. Will revert.

@jreback jreback added Frequency DateOffsets Internals Related to non-user accessible pandas implementation labels Oct 29, 2017
def _should_cache(self):
return self.isAnchored() and self._cacheable

def __repr__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note, the repr is currently used for hashing, but instead should simply define __hash__ I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__hash__ is defined using _params() which is the god-awful slow thing we need to get rid of.

@jreback
Copy link
Contributor

jreback commented Nov 3, 2017

small comments, and rebase

@jbrockmendel
Copy link
Member Author

@jreback For triaging purposes, this is the only one of my PRs that is blocking non-refactoring work.

from pandas._libs.tslib import pydt_to_i8

from frequencies cimport get_freq_code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update setup.py for this

@jreback jreback added this to the 0.22.0 milestone Nov 7, 2017
@jreback
Copy link
Contributor

jreback commented Nov 7, 2017

lgtm ping on green

@jbrockmendel
Copy link
Member Author

TestClipboard.test_round_trip_valid_encodings, otherwise green. Will push a dummy commit anyway.

@jbrockmendel
Copy link
Member Author

Ping

@jreback jreback merged commit d3d60f8 into pandas-dev:master Nov 8, 2017
@jreback
Copy link
Contributor

jreback commented Nov 8, 2017

thanks!

watercrossing pushed a commit to watercrossing/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
@jbrockmendel jbrockmendel deleted the tslibs-offsets3 branch December 8, 2017 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frequency DateOffsets Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants