Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate out strptime.pyx from tslib #17342

Merged
merged 16 commits into from
Sep 25, 2017
Merged

Conversation

jbrockmendel
Copy link
Member

This is the 2nd of an N part series of PRs to split tslib into independent modules.

At the moment there is a big chunk of code at the bottom of tslib that looks like it was pasted in from somewhere else. The header for that section of the file reads # Don't even ask. So I won't.

The new tslibs.strptime only used in one place: array_strptime is called in tools.datetimes. Other than that, nothing needs to be exposed, and nothing else in tslib relies on it.

The one function from tslib that strptime does need is _check_dts_bounds. This (mostly) moves that up to datetime.pxd, which both tslib and strptime already import anyway.

This is mostly a copy/paste of the existing functions+classes. I cleaned up a couple of places where variables used camelCase.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@chris-b1
Copy link
Contributor

Please don't make changes and move things in the same commit - it completely hides the changes when we're reviewing - thanks!

@jreback
Copy link
Contributor

jreback commented Aug 26, 2017

what is the rationale for moving this in the first place?

@gfyoung gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Aug 26, 2017
@codecov
Copy link

codecov bot commented Aug 26, 2017

Codecov Report

Merging #17342 into master will decrease coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17342      +/-   ##
==========================================
- Coverage   91.03%   90.99%   -0.05%     
==========================================
  Files         162      162              
  Lines       49567    49567              
==========================================
- Hits        45125    45103      -22     
- Misses       4442     4464      +22
Flag Coverage Δ
#multiple 88.77% <ø> (-0.03%) ⬇️
#single 40.24% <ø> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 63.23% <0%> (-1.82%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️
pandas/core/indexes/datetimes.py 95.23% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 473a7f3...262da03. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 26, 2017

Codecov Report

Merging #17342 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17342      +/-   ##
==========================================
- Coverage   91.26%   91.24%   -0.02%     
==========================================
  Files         163      163              
  Lines       49776    49807      +31     
==========================================
+ Hits        45426    45447      +21     
- Misses       4350     4360      +10
Flag Coverage Δ
#multiple 89.04% <100%> (ø) ⬆️
#single 40.3% <50%> (-0.04%) ⬇️
Impacted Files Coverage Δ
pandas/_libs/__init__.py 100% <ø> (ø) ⬆️
pandas/core/tools/datetimes.py 85.24% <100%> (+0.04%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/accessor.py 93.65% <0%> (-6.35%) ⬇️
pandas/core/frame.py 97.73% <0%> (-0.15%) ⬇️
pandas/core/series.py 94.93% <0%> (ø) ⬆️
pandas/core/indexes/category.py 97.74% <0%> (ø) ⬆️
pandas/core/categorical.py 95.28% <0%> (ø) ⬆️
pandas/io/parsers.py 95.48% <0%> (+0.02%) ⬆️
pandas/core/generic.py 92.05% <0%> (+0.05%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d43aba8...5478aa2. Read the comment docs.

@jbrockmendel
Copy link
Member Author

what is the rationale for moving this in the first place?

In order of "most specific to this PR" to "most general to this sequence of PRs":

  1. It is not immediately obvious that strptime is independent of the rest of tslib. Separating it out makes both pieces easier to maintain. DAGs FTW.

  2. tslib is huge and gets imported in a whole bunch of places, often times just for NaT. Making that import more light-weight is a win.

  3. Over at dateutil we're talking about the next big push, and I'm advocating for making things more pandas-friendly. e.g. see tslib.dateutil_parse docstring """ lifted from dateutil to get resolution""". Having dateutil return this resolution information would make this workaround unnecessary, improving performance and code complexity.
    To that end, I'm trying to centralize the date parsing code that happens in a handful of places around pandas (core.tools.datetimes, _libs.tslib, _libs/src/inference.pyx) and separate it out into module(s) with minimal intra-pandas dependencies.

  4. A big part of the goal for this series of PRs is to disentangle the dependencies between tslib, period, and tseries, as tseries is a big performance bottleneck (in particular tseries.offsets.DateOffset, see Make some attributes of DateOffset read-only; add tests #17181 ).

  5. See this comment in _libs.__init__:

# TODO
# period is directly dependent on tslib and imports python
# modules, so exposing Period as an alias is currently not possible

One of the big goals of this sequence of PRs is handling this.

@jbrockmendel
Copy link
Member Author

Just pushed a commit to remove cython "non-standard" cython decorators and revert a camelCase fix. i.e. with the exception of a couple of flake8 whitespace changes, this should be a cut/paste of the existing functions.

@jreback
Copy link
Contributor

jreback commented Sep 7, 2017

this will need a rebase after #17422



cdef inline check_dts_bounds(pandas_datetimestruct *dts):
cdef:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to make this a bint, and just return True/False.

Then make another function which calls this one which actually raises the error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The longer-term solution that I've implemented but not PRed is to move a bunch of datetime.pxd into a pyx file where OutOfBoundsDatetime can be defined. This also ends up clearing out a lot of the setup.py dependencies (orthogonal to this discussion, but still).

Is returning a bint actually any simpler than re-raising? It isn't obvious whether True means "error state is True" or "everything is OK is True", whereas raising is unambiguous.

That said, I'll do it your way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah its ok, just don't like to muddle even more.

error = True

if error:
# Retaining this is a kludge because I haven't figured out how
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

setup.py Outdated
@@ -464,6 +464,8 @@ def pxd(name):
'pandas/_libs/src/period_helper.h',
'pandas/_libs/src/datetime.pxd']

np_dtime_strs_srcs = ['pandas/_libs/src/datetime/np_datetime.c',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these sources are also used else where, just list them for now directly (e.g. like period.pyx and others do it). This doesn't add anything

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. This will lead to flake8 warnings that I'm happy to ignore.

@@ -5356,317 +5074,4 @@ def shift_months(int64_t[:] dtindex, int months, object day=None):

#----------------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can take this comment out :<


# def _strptime_time(data_string, format="%a %b %d %H:%M:%S %Y"):
# return _strptime(data_string, format)[0]
from tslibs.strptime import array_strptime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this at all; this is called in exactly 1 place in the code simply call it there. then strptime is decoupled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. I've been importing things back into tslib with #noqa to keep the namespace unchanged. Is that unnecessary more generally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might not be fully flaking this type of thing (or maybe we are and have to use noqa). I would like to remove non-essential imports though. These are ll private namespaces.

@pep8speaks
Copy link

pep8speaks commented Sep 14, 2017

Hello @jbrockmendel! Thanks for updating the PR.

  • In the file setup.py, following are the PEP8 issues :

Line 487:80: E501 line too long (84 > 79 characters)
Line 488:80: E501 line too long (94 > 79 characters)

Comment last updated on September 25, 2017 at 00:13 Hours UTC


FUNCTIONS:
_getlang -- Figure out what language is being used for the locale
strptime -- Calculates the time struct represented by the passed-in string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you copy pasted this, but this does not really suite for the module docstring, as this is only for part of the file.

So I would either move this inline to where those functions/classes are defined (as it was in tslib), or update it to reflect the actual file contents

If I understand correctly array_strptime is the only 'public' (for pandas, as used outside this file) function? If so, I would focus on that in the module docstring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a handful of style issues I'd like to circle back to (e.g. mixed_camel_Case), will add this to the list. For the time being I'm really eager to wrapping this up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really a style issue, but how you copy pasted things. If you want to clean-up later, I would just put this in the middle close of the file close to the relevant code for now (like it was in tslib)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand. Are you OK with this being addressed in a follow-up PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather you fix the docs now. just move these docs down to where they are used, and put array_strptime up front

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I see you integrated them into the class/functions docstrings, which is good, but I would still include a comment where LocaleTime starts (the place where previously the comment was) to make it clear that this part is some kind of vendored code (it just comes from the standard library: https://github.com/python/cpython/blob/master/Lib/_strptime.py)

@jorisvandenbossche
Copy link
Member

There is some other code related to parsing strings in tslib as well. Does it make sense to move those as well?

@jorisvandenbossche
Copy link
Member

There is some other code related to parsing strings in tslib as well. Does it make sense to move those as well?

Ah, I see there is already #17363 :-)


cdef set _nat_strings = set(['NaT', 'nat', 'NAT', 'nan', 'NaN', 'NAN'])


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move both of these to util.pxd? to avoid repeating code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_nat_strings and what else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checknull_with_nat, this seems like its repeating in lots of places.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

needs a rebase

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally ok with this. a couple of doc comments. if you can move checknull_with_nat (to avoid repeating code all over the place) to a .pxd great. if not, can do later. but pls make an issue in that case.


FUNCTIONS:
_getlang -- Figure out what language is being used for the locale
strptime -- Calculates the time struct represented by the passed-in string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather you fix the docs now. just move these docs down to where they are used, and put array_strptime up front


cdef set _nat_strings = set(['NaT', 'nat', 'NAT', 'nan', 'NaN', 'NAN'])


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checknull_with_nat, this seems like its repeating in lots of places.

@jbrockmendel
Copy link
Member Author

checknull_with_nat, this seems like its repeating in lots of places

Ultimately both checknull_with_nat and _nat_strings belong in the not-yet-introduced tslibs.nattype. First choice would be to live with this bit of redundancy until that is ready. But putting them in util.pxd (or datetime.pxd) wouldn't be the end of the world.

@jbrockmendel jbrockmendel mentioned this pull request Sep 23, 2017
59 tasks

cdef inline bint _checknull_with_nat(object val):
""" utility to check if a value is a nat or not """
return (val is None or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to the top and add a TODO to consolidate in future

return 1 + days_to_week + day_of_week


# def _strptime_time(data_string, format="%a %b %d %H:%M:%S %Y"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented

if is_coerce:
iresult[i] = NPY_NAT
continue
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth it having a tiny extension util.pyx that houses things like this (and checknull for example)
something that we can import anywhere that isn't a dependency itself

Copy link
Contributor

@jreback jreback Sep 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i bet inference.pyx can be this extension

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inference.pyx can be this extension

? inference.pyx depends on a bunch of stuff, including tslib.

The solution I've used locally is to put this into a tslibs.npy_dtime which has no dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the name needs work though :>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the name needs work though :>

OK, as long as it isn't another thing named datetime.

@jreback jreback merged commit 42195db into pandas-dev:master Sep 25, 2017
@jreback
Copy link
Contributor

jreback commented Sep 25, 2017

thanks!

@jreback jreback added this to the 0.21.0 milestone Sep 25, 2017
@jbrockmendel jbrockmendel deleted the strptime branch October 30, 2017 16:23
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants