-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate parsing functions out from tslib #17363
Conversation
Move parsing functions from _libs/src/inference and core.tools.datetimes
Codecov Report
@@ Coverage Diff @@
## master #17363 +/- ##
==========================================
- Coverage 91.25% 91.24% -0.02%
==========================================
Files 163 163
Lines 49808 49734 -74
==========================================
- Hits 45454 45378 -76
- Misses 4354 4356 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls just move code and make minimal changes. you are adding a bunch of cython decorators which don't actually do anything for a python called function.
overall this looks ok, but its very hard to tell if you changed things (which it is clear you did as you had to change some calling code)
pandas/core/tools/datetimes.py
Outdated
@@ -726,6 +579,11 @@ def parse_time_string(arg, freq=None, dayfirst=None, yearfirst=None): | |||
------- | |||
datetime, datetime/dateutil.parser._result, str | |||
""" | |||
res = tslib.parse_time_string(arg, freq, dayfirst, yearfirst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wrapping belongs in tslib, will be fixed shortly.
@@ -1178,6 +1179,8 @@ class Period(_Period): | |||
value = str(value) | |||
value = value.upper() | |||
dt, _, reso = parse_time_string(value, freq) | |||
if dt is NAT_SENTINEL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tslibs.parsing
does not have NaT
in the namespace, so it returns NAT_SENTINEL
in places where it otherwise would return NaT
. That should be wrapped in tslib
, will update.
pandas/_libs/tslibs/parsing.pyx
Outdated
@@ -0,0 +1,688 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we add executable pre-amble to any other .pyx files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove.
pandas/_libs/tslibs/parsing.pyx
Outdated
#!/usr/bin/env python | ||
# -*- coding: utf-8 -*- | ||
# cython: profile=False | ||
# cython: linetrace=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on the purpose of this module
from numpy cimport int64_t, ndarray | ||
np.import_array() | ||
|
||
# Avoid import from outside _libs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separately (:>), consider add ing a compat for .pyx which can be imported (could also simply be a .pyx that is included) to provide things like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this to the Todo list as well
pandas/_libs/tslibs/parsing.pyx
Outdated
# This allows us to reference NaT without having to import it | ||
|
||
|
||
@cython.locals(date_string=object, freq=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is not much point in typing these (e.g. using cython locals) and is just noise.
if you really think things will be better via typing (and you are actually calling from another cdef function), then make this a cpdef (or use a cdef and make the def call the cdef)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove.
pandas/_libs/tslibs/parsing.pyx
Outdated
dt = du_parse(date_string, dayfirst=dayfirst, | ||
yearfirst=yearfirst, **kwargs) | ||
return dt | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line in between clauses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was copy-pasted, will change.
pandas/_libs/tslibs/parsing.pyx
Outdated
yearfirst=yearfirst) | ||
|
||
|
||
@cython.locals(date_string=object, freq=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/_libs/tslibs/parsing.pyx
Outdated
return parsed, parsed, reso | ||
|
||
|
||
@cython.returns(cython.bint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are not typical, again you don't need to type everything. if its a perf issue, pls demonstrate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing version is in tslib
, has this typing information, though not using the decorators. I'll revert from pure-python-mode to be identical to existing for now.
pandas/_libs/tslibs/parsing.pyx
Outdated
return True | ||
|
||
|
||
@cython.returns(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing implementation in tslib
:
cdef inline object _parse_dateabbr_string(object date_string, object default,
object freq):
I don't mind reverting, generally just like not having to wrap lines.
Remove cython decorators Move wrapping of parse_time_string to tslib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am generally ok with this, but left some comments. ping when ready for re-look.
pandas/_libs/tslib.pyx
Outdated
|
||
#---------------------------------------------------------------------- | ||
# Parsing | ||
# Wrap tslibs.parsing functions to return `NaT` instead of `NAT_SENTINEL` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what this comment means here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just removed.
pandas/_libs/tslib.pyx
Outdated
# Wrap tslibs.parsing functions to return `NaT` instead of `NAT_SENTINEL` | ||
|
||
|
||
def parse_time_string(arg, freq=None, dayfirst=None, yearfirst=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this here? it should be in tslib.parsing
, if you need NaT
inside the function there, simply import it (inside the function)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was trying to avoid runtime import; changed in just-pushed update.
pandas/_libs/tslibs/parsing.pyx
Outdated
|
||
# The canonical place for this appears to be in frequencies.pyx. | ||
@cython.returns(object) | ||
@cython.locals(source=object, default=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please please don't use any non-standard cython decorators that are NOT already present in the code base. This introduces too much overhead (when should I use them or not). If you want to introduce them, please do so in another PR that adds them everywhere (or at the very least an issue to keep track of where they should be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just changed, will make a note of this going forward.
I may revisit this (in a separate PR(s)) after cython 0.27 becomes mainstream. It would be nice to have code be valid python where possible.
again a full run asv would be nice |
Reviewer request to import NaT in parse_time_string
I'll run the whole suite next; here are results from just
|
Full results:
|
so the benchmarks are telling u that something is amiss these are generally petty stable so run the ones that are out of whack and try to narrow it down |
Run again immediately:
Something is amiss, but that is orthogonal to the PR, which is just a refactor. |
it may be, but until these show stable results and we can tell whether something is changed, can't merge these. |
Setting the affinity seems to have made a difference:
Immediate re-run of the ones that showed a change
Of these, the only one that looks related to this PR is |
For each of |
Made this locally long ago; for some reason it is not getting reflected on GH when I push. It's a mystery.
needs a rebase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebase as well
pandas/_libs/parsing.pyx
Outdated
@@ -0,0 +1,682 @@ | |||
# -*- coding: utf-8 -*- | |||
# cython: profile=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in tslibs/parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is it possible that ci is green given that all references in setup.py refer to tslibs/parsing.pyx ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the file is just duplicated in the diff, that's why it still works :-).
@jbrockmendel so this file can just be removed (if they are identical)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To get asv to run a while back I had to move tslibs/parsing.pyx to just parsing.pyx. This is an artifact that shouldn't have gotten pushed. Will remove.
pandas/_libs/parsing.pyx
Outdated
|
||
|
||
class DateParseError(ValueError): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an error that is raised to users or only used internally?
In the first case we should add this to pandas.errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be raised externally. Is currently in tslib
. I think there's been some effort to avoid importing non-cython modules into _libs.foo
pandas/_libs/tslib.pyx
Outdated
_format_is_iso, | ||
_DATEUTIL_LEXER_SPLIT, | ||
_guess_datetime_format, | ||
NAT_SENTINEL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this import needed? (it was not here before)
pandas/_libs/tslib.pyx
Outdated
parse_datetime_string, | ||
parse_time_string, | ||
_does_string_look_like_datetime, | ||
parse_datetime_string_with_reso) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also those ones. Do you just import them to keep them in the namespace of tslib as before? Are they actually used from here somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these are just for backward compat in case some user somewhere is using them. I'll be happy to remove those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above. don't import these in tslib.pyx unless they are actually used, rather change to directly import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u address these comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, just pushed. Retained the from tslibs.parsing import DateParseError # noqa
as it is referenced in a handful of places around pandas (and conceivably downstream), removed all the others.
pandas/_libs/parsing.pyx
Outdated
@@ -0,0 +1,682 @@ | |||
# -*- coding: utf-8 -*- | |||
# cython: profile=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the file is just duplicated in the diff, that's why it still works :-).
@jbrockmendel so this file can just be removed (if they are identical)
pandas/_libs/parsing.pyx
Outdated
return res | ||
|
||
|
||
def parse_datetime_string_with_reso(date_string, freq=None, dayfirst=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe not for this PR if we want to keep it 'moving only', but after moving it would make sense to me to combine this with the function above, as this is the only place where it is used
pandas/_libs/parsing.pyx
Outdated
return dt | ||
|
||
|
||
def parse_time_string(arg, freq=None, dayfirst=None, yearfirst=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what the difference is between this parse_time_string
and the parse_datetime_string
above?
From a superficial look they seem very similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do not tackle that in this PR, would you like to open an issue for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added this to #17652
ace22e0
to
ff57861
Compare
pandas/_libs/src/inference.pyx
Outdated
try_parse_date_and_time, | ||
try_parse_year_month_day, | ||
try_parse_datetime_components) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you actually need to import these here, rather import them directly where they are used
pandas/_libs/tslib.pyx
Outdated
parse_datetime_string, | ||
parse_time_string, | ||
_does_string_look_like_datetime, | ||
parse_datetime_string_with_reso) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above. don't import these in tslib.pyx unless they are actually used, rather change to directly import
class DateParseError(ValueError): | ||
pass | ||
|
||
_nat_strings = set(['NaT', 'nat', 'NAT', 'nan', 'NaN', 'NAN']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_nat_strings should be in datetime.pxd (or maybe in util.pxd)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the recent build-based threads, I'd like to (at least for now) avoid introducing this dependency, especially for something this small. Eventually you're right we'll want to put these constants all in one place.
|
||
cdef set _not_datelike_strings = set(['a', 'A', 'm', 'M', 'p', 'P', 't', 'T']) | ||
|
||
NAT_SENTINEL = object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
pandas/_libs/tslibs/parsing.pyx
Outdated
return ret, reso | ||
|
||
|
||
# The canonical place for this appears to be in frequencies.pyx. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you add this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a note to self, can be removed for now. Eventually this function will belong in tslibs.frequencies
.
pandas/_libs/tslibs/parsing.pyx
Outdated
|
||
|
||
#---------------------------------------------------------------------- | ||
# Miscellaneous functions moved from core.tools.datetimes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a useful comment (the miscellanous is fine)
setup.py
Outdated
@@ -485,6 +486,7 @@ def pxd(name): | |||
'sources': ['pandas/_libs/src/datetime/np_datetime.c', | |||
'pandas/_libs/src/datetime/np_datetime_strings.c', | |||
'pandas/_libs/src/period_helper.c']}, | |||
'_libs.tslibs.parsing': {'pyxfile': '_libs/tslibs/parsing'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont' see any reason that parsing should not depend on util, otherwise you are adding boilerplate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does depend on util, but specifying util here is redundant with the common_include
that gets pinned on later.
If we changed this entry to:
'_libs.tslibs.parsing': {'pyxfile': '_libs/tslibs/parsing', 'pxdfiles': ['_libs/src/util'], 'include': []},
we would get build errors (or possibly import-time) because util
requires "src/headers/stdint.h" and "src/numpy_helper.h". And numpy_helper requires "helper.h". I'd prefer to specify these all explicitly and not have the common_include
tacked on, but haven't found a way to make this work.
Update On another look, I don't see any reference to util.
Maybe you're thinking of the strptime
PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add util.pxd back into this in the interest of getting this merged; we can figure out what is and isn't necessary in the other threads.
GH won't let me respond inline to |
…libs-parsing Remove unused imports from tslib; test file to import directly
# cython: profile=False | ||
# cython: linetrace=False | ||
# distutils: define_macros=CYTHON_TRACE=0 | ||
# distutils: define_macros=CYTHON_TRACE_NOGIL=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose those were for debugging? (if set to True)
We don't keep those anywhere else (apart from the profile=False), so not sure we should do it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be happy to remove these if requested. I habitually keep them around so I don't have to remember what the options are. Say the word and they're gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import comment and some future suggests. revise and ping on green.
|
||
cdef set _not_datelike_strings = set(['a', 'A', 'm', 'M', 'p', 'P', 't', 'T']) | ||
|
||
NAT_SENTINEL = object() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC we are now defining NAT_SENTINEL in multiple places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, no.
from pandas._libs.tslibs import parsing | ||
from pandas._libs.tslibs.parsing import ( # noqa | ||
parse_time_string, | ||
_format_is_iso, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can come back in future and de-privatize these in tslib.parsing (and change here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added to TODO list in #17652
pandas/core/tools/datetimes.py
Outdated
from pandas._libs.tslibs.parsing import ( # noqa | ||
parse_time_string, | ||
_format_is_iso, | ||
_DATEUTIL_LEXER_SPLIT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think _DATEUTIL_LEXER_SPLIT Is now private to tstlib.parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed a commit that removes this import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments. ping when green.
@@ -66,6 +66,9 @@ from khash cimport ( | |||
kh_init_int64, kh_int64_t, | |||
kh_resize_int64, kh_get_int64) | |||
|
|||
from .tslibs.parsing import parse_datetime_string | |||
from .tslibs.parsing import DateParseError # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is DateParseError actually used anywhere here? it doesn't look like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Retained it because it is used/imported in many places around pandas (and conceivably downstream). Will change on request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, ok let's fix that in a followup then.
from numpy cimport int64_t, ndarray | ||
np.import_array() | ||
|
||
# Avoid import from outside _libs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this to the Todo list as well
ping |
ok, let's rebase |
thanks! pls add the followups to the list. |
This is part 3 in an N part series of PRs to split
tslib
into thematically distinct modules. The others so far are #17274 and #17342.Moves parsing functions from _libs/src/inference and
core.tools.datetimes
.The
tslibs.parsing
module has no within-pandas dependencies. There are some dateutil workarounds that ideally can be upstreamed.git diff upstream/master -u -- "*.py" | flake8 --diff