Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate parsing functions out from tslib #17363

Merged
merged 20 commits into from
Sep 26, 2017

Conversation

jbrockmendel
Copy link
Member

This is part 3 in an N part series of PRs to split tslib into thematically distinct modules. The others so far are #17274 and #17342.

Moves parsing functions from _libs/src/inference and core.tools.datetimes.

The tslibs.parsing module has no within-pandas dependencies. There are some dateutil workarounds that ideally can be upstreamed.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Move parsing functions from _libs/src/inference and core.tools.datetimes
@gfyoung gfyoung added Clean Internals Related to non-user accessible pandas implementation labels Aug 28, 2017
@codecov
Copy link

codecov bot commented Aug 28, 2017

Codecov Report

Merging #17363 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17363      +/-   ##
==========================================
- Coverage   91.25%   91.24%   -0.02%     
==========================================
  Files         163      163              
  Lines       49808    49734      -74     
==========================================
- Hits        45454    45378      -76     
- Misses       4354     4356       +2
Flag Coverage Δ
#multiple 89.03% <100%> (-0.01%) ⬇️
#single 40.33% <54.54%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/io/date_converters.py 100% <100%> (ø) ⬆️
pandas/core/indexes/base.py 96.28% <100%> (ø) ⬆️
pandas/io/parsers.py 95.48% <100%> (ø) ⬆️
pandas/core/tools/datetimes.py 83.79% <100%> (-1.46%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.73% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0fe5cc...f89d11e. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls just move code and make minimal changes. you are adding a bunch of cython decorators which don't actually do anything for a python called function.

overall this looks ok, but its very hard to tell if you changed things (which it is clear you did as you had to change some calling code)

@@ -726,6 +579,11 @@ def parse_time_string(arg, freq=None, dayfirst=None, yearfirst=None):
-------
datetime, datetime/dateutil.parser._result, str
"""
res = tslib.parse_time_string(arg, freq, dayfirst, yearfirst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrapping belongs in tslib, will be fixed shortly.

@@ -1178,6 +1179,8 @@ class Period(_Period):
value = str(value)
value = value.upper()
dt, _, reso = parse_time_string(value, freq)
if dt is NAT_SENTINEL:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tslibs.parsing does not have NaT in the namespace, so it returns NAT_SENTINEL in places where it otherwise would return NaT. That should be wrapped in tslib, will update.

@@ -0,0 +1,688 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we add executable pre-amble to any other .pyx files

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# cython: profile=False
# cython: linetrace=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on the purpose of this module

from numpy cimport int64_t, ndarray
np.import_array()

# Avoid import from outside _libs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separately (:>), consider add ing a compat for .pyx which can be imported (could also simply be a .pyx that is included) to provide things like this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this to the Todo list as well

# This allows us to reference NaT without having to import it


@cython.locals(date_string=object, freq=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is not much point in typing these (e.g. using cython locals) and is just noise.

if you really think things will be better via typing (and you are actually calling from another cdef function), then make this a cpdef (or use a cdef and make the def call the cdef)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove.

dt = du_parse(date_string, dayfirst=dayfirst,
yearfirst=yearfirst, **kwargs)
return dt
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line in between clauses

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copy-pasted, will change.

yearfirst=yearfirst)


@cython.locals(date_string=object, freq=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

return parsed, parsed, reso


@cython.returns(cython.bint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are not typical, again you don't need to type everything. if its a perf issue, pls demonstrate it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing version is in tslib, has this typing information, though not using the decorators. I'll revert from pure-python-mode to be identical to existing for now.

return True


@cython.returns(object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing implementation in tslib:

cdef inline object _parse_dateabbr_string(object date_string, object default,
                                           object freq):

I don't mind reverting, generally just like not having to wrap lines.

Remove cython decorators

Move wrapping of parse_time_string to tslib
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am generally ok with this, but left some comments. ping when ready for re-look.


#----------------------------------------------------------------------
# Parsing
# Wrap tslibs.parsing functions to return `NaT` instead of `NAT_SENTINEL`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what this comment means here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removed.

# Wrap tslibs.parsing functions to return `NaT` instead of `NAT_SENTINEL`


def parse_time_string(arg, freq=None, dayfirst=None, yearfirst=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this here? it should be in tslib.parsing, if you need NaT inside the function there, simply import it (inside the function)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was trying to avoid runtime import; changed in just-pushed update.


# The canonical place for this appears to be in frequencies.pyx.
@cython.returns(object)
@cython.locals(source=object, default=object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please please don't use any non-standard cython decorators that are NOT already present in the code base. This introduces too much overhead (when should I use them or not). If you want to introduce them, please do so in another PR that adds them everywhere (or at the very least an issue to keep track of where they should be).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just changed, will make a note of this going forward.

I may revisit this (in a separate PR(s)) after cython 0.27 becomes mainstream. It would be nice to have code be valid python where possible.

@jreback
Copy link
Contributor

jreback commented Sep 6, 2017

again a full run asv would be nice

@jbrockmendel
Copy link
Member Author

I'll run the whole suite next; here are results from just -b timeseries

asv continuous -f 1.1 -E virtualenv master HEAD -b timeseries
[...]
       before           after         ratio
     [36dadd70]       [c52c7968]
+     19.2±0.08μs       22.6±0.1μs     1.17  timeseries.Offsets.time_custom_bday_incr
+           1.49s            1.64s     1.10  timeseries.Iteration.time_iter_periodindex
-      23.9±0.1μs       21.6±0.1μs     0.91  timeseries.Offsets.time_timeseries_year_incr
-     18.1±0.09μs      16.2±0.06μs     0.89  timeseries.Offsets.time_timeseries_year_apply
-         129±5μs        115±0.2μs     0.89  timeseries.DatetimeIndex.time_unique
-      46.2±0.8μs       40.9±0.2μs     0.88  timeseries.SemiMonthOffset.time_end_apply
-     7.23±0.05μs       6.00±0.1μs     0.83  timeseries.DatetimeIndex.time_timestamp_tzinfo_cons
-      2.02±0.2ms         1.67±0ms     0.82  timeseries.DatetimeIndex.time_add_timedelta

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jbrockmendel
Copy link
Member Author

Full results:

asv continuous -f 1.1 -E virtualenv master HEAD
[...]

       before           after         ratio
     [36dadd70]       [c52c7968]
!           38.4s           failed      n/a  gil.nogil_datetime_fields.time_period_to_datetime
+         273±3ns         270±30ms 991429.63  indexing.MultiIndexing.time_is_monotonic
+        55.1±2μs       1.02±0.01s 18449.64  indexing.Int64Indexing.time_loc_scalar
+        484±20μs        299±300ms   618.36  indexing.Int64Indexing.time_ix_list_like
+     1.96±0.08ms       1.16±0.03s   590.47  indexing.Int64Indexing.time_loc_array
+         235±5μs         136±40ms   577.51  indexing.MultiIndexing.time_series_xs_mi_ix
+        660±30μs         254±10ms   384.79  indexing.Int64Indexing.time_loc_list_like
+      17.9±200ms            3.66s   204.48  join_merge.ConcatFrames.time_c_ordered_axis0
+       128±100ms            2.94s    22.98  join_merge.ConcatFrames.time_c_ordered_axis1
+        415±50ms            3.90s     9.39  gil.nogil_take1d_float64.time_nogil_take1d_float64
+        436±80ms            3.77s     8.65  gil.nogil_take1d_int64.time_nogil_take1d_int64
+      7.88±0.1ms       57.0±0.8ms     7.24  binary_ops.TimeseriesTZ.time_timestamp_ops_diff1
+       75.6±30ms          416±7ms     5.50  inference.to_numeric_downcast.time_downcast('datetime64', 'integer')
+      80.6±0.9ms         428±40ms     5.30  index_object.SetOperations.time_int64_union
+           124ms            606ms     4.87  packers.JSON.time_write_json_mixed_float_int_T
+           1.36s            6.56s     4.84  join_merge.ConcatPanels.time_c_ordered_axis0
+        526±20ms            2.48s     4.72  indexing.MultiIndexing.time_multiindex_large_get_loc_warm
+           1.62s            7.58s     4.68  join_merge.ConcatPanels.time_f_ordered_axis1
+           1.65s            7.34s     4.46  join_merge.ConcatPanels.time_f_ordered_axis2
+           1.67s            7.35s     4.40  join_merge.ConcatPanels.time_c_ordered_axis1
+        241±20ms         964±10ms     4.00  indexing.Int64Indexing.time_getitem_lists
+        236±10ms         910±10ms     3.85  indexing.Int64Indexing.time_getitem_array
+           2.28s            8.56s     3.75  join_merge.ConcatPanels.time_c_ordered_axis2
+          55.7ms            206ms     3.70  packers.JSON.time_write_json
+        245±10ms          904±4ms     3.69  indexing.Int64Indexing.time_getitem_list_like
+           124ms            456ms     3.69  packers.JSON.time_write_json_T
+         227±3ms         793±20ms     3.50  join_merge.Align.time_series_align_left_monotonic
+         342±8ms       1.17±0.03s     3.41  index_object.SetOperations.time_int64_symmetric_difference
+         119±1ms         404±10ms     3.40  inference.to_numeric_downcast.time_downcast('string-float', None)
+       125±0.7ms         400±10ms     3.19  inference.to_numeric_downcast.time_downcast('string-float', 'signed')
+        306±10ms         974±30ms     3.18  join_merge.Align.time_series_align_int64_index
+        355±10ms            1.10s     3.10  indexing.StringIndexing.time_get_value
+         363±6ms            1.09s     3.00  indexing.StringIndexing.time_getitem_label_slice
+         128±4ms          352±6ms     2.75  inference.to_numeric_downcast.time_downcast('string-float', 'unsigned')
+     1.54±0.04ms       4.19±0.2ms     2.72  indexing.Int64Indexing.time_ix_array
+       218±0.3ms          590±1ms     2.70  inference.to_numeric_downcast.time_downcast('string-nint', 'signed')
+         197±2ms         504±10ms     2.56  inference.to_numeric_downcast.time_downcast('string-nint', 'unsigned')
+       220±0.6ms         548±20ms     2.49  inference.to_numeric_downcast.time_downcast('string-int', 'signed')
+       204±0.5ms          492±4ms     2.41  inference.to_numeric_downcast.time_downcast('string-nint', None)
+          75.5ms            168ms     2.23  packers.JSON.time_write_json_mixed_float_int_str
+       236±0.2ms       504±0.01ms     2.14  inference.to_numeric_downcast.time_downcast('string-int', 'integer')
+       219±0.2ms         448±20ms     2.04  inference.to_numeric_downcast.time_downcast('string-nint', 'float')
+           1.77s            3.60s     2.04  join_merge.ConcatPanels.time_f_ordered_axis0
+       256±0.4ms       500±0.04ms     1.95  inference.to_numeric_downcast.time_downcast('string-nint', 'integer')
+         268±2ms            509ms     1.90  frame_methods.Dropna.time_count_level_axis0_mixed_dtypes_multi
+      54.3±0.5μs       83.8±0.1μs     1.54  indexing.Int64Indexing.time_ix_slice
+      33.6±0.4μs         50.6±3μs     1.51  indexing.Int64Indexing.time_iloc_list_like
+     7.17±0.07ms       10.7±0.5ms     1.49  timeseries.DatetimeIndex.time_add_offset_fast
+        271±80ms         403±30ms     1.49  index_object.SetOperations.time_int64_difference
+         278±8μs         376±30μs     1.36  indexing.MultiIndexing.time_frame_xs_mi_ix
+       457±0.4μs        602±0.4μs     1.32  groupby.GroupBySuite.time_mean('int', 100)
+           10.5s            13.7s     1.30  join_merge.MergeCategoricals.time_merge_cat
+           324ms            417ms     1.28  gil.nogil_read_csv.time_read_csv
+      73.0±0.7ms         91.7±6ms     1.26  gil.nogil_rolling_algos_fast.time_nogil_rolling_kurt
+        341±30ms         425±10ms     1.25  indexing.MultiIndexing.time_multiindex_get_indexer
+      26.1±200ms           32.0ms     1.22  join_merge.ConcatFrames.time_f_ordered_axis0
+        991±70μs      1.21±0.03ms     1.22  inference.to_numeric_downcast.time_downcast('datetime64', None)
+         474±4ms            579ms     1.22  frame_methods.Dropna.time_dropna_axis0_all_mixed_dtypes
+      23.0±0.2ms       27.0±0.3ms     1.18  packers.packers_read_hdf_store.time_packers_read_hdf_store
+     9.43±0.05ms       10.9±0.3ms     1.15  algorithms.Algorithms.time_factorize_string
+     3.52±0.02ms      4.05±0.03ms     1.15  timeseries.SeriesArithmetic.time_add_offset_delta
+        1.26±0ms       1.44±0.1ms     1.15  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('BusinessDay', 2)
+       131±0.9ms          148±1ms     1.13  inference.to_numeric_downcast.time_downcast('string-float', 'integer')
+     41.5±0.06μs       46.5±0.2μs     1.12  timeseries.SemiMonthOffset.time_begin_apply
+     4.51±0.05ms      5.03±0.02ms     1.12  algorithms.Algorithms.time_add_overflow_zero_scalar
+     3.97±0.03ms      4.40±0.01ms     1.11  io_bench.read_uint64_integers.time_read_uint64_neg_values
+     4.04±0.03ms       4.45±0.2ms     1.10  io_bench.read_uint64_integers.time_read_uint64_na_values
+      236±0.04ms         260±40ms     1.10  inference.to_numeric_downcast.time_downcast('string-int', 'unsigned')
-        81.2±2ms         73.8±2ms     0.91  frame_methods.Dropna.time_dropna_axis1_any
-         498±6ms        452±0.2ms     0.91  timeseries.ToDatetime.time_format_no_exact
-     19.4±0.06μs       17.5±0.2μs     0.90  timeseries.Offsets.time_custom_bday_apply_dt64
-      2.69±0.2ms      2.42±0.01ms     0.90  rolling.SeriesRolling.time_rolling_skew
-     1.71±0.03ms      1.54±0.01ms     0.90  reshape.melt_dataframe.time_melt_dataframe
-     1.17±0.03ms      1.05±0.02ms     0.90  replace.replace_replacena.time_replace_replacena
-      9.82±0.2ms      8.76±0.07ms     0.89  groupby.groupby_size.time_groupby_size
-      22.0±0.3ms       19.6±0.2ms     0.89  frame_methods.frame_fancy_lookup.time_frame_fancy_lookup_all
-     27.6±0.06μs      24.5±0.05μs     0.89  timeseries.Offsets.time_custom_bday_cal_incr
-     1.17±0.03ms      1.04±0.02ms     0.89  replace.replace_fillna.time_replace_fillna
-     6.36±0.01μs         5.63±0μs     0.89  timeseries.DatetimeIndex.time_timestamp_tzinfo_cons
-     1.34±0.01ms         1.18±0ms     0.88  frame_ctor.FrameConstructorDTIndexFromOffsets.time_frame_ctor('Day', 2)
-        437±10μs         384±20μs     0.88  frame_methods.FrameIsnull.time_isnull
-         367±2ms          323±1ms     0.88  frame_methods.Dropna.time_dropna_axis0_all
-      17.8±0.1μs      15.3±0.07μs     0.86  timeseries.Offsets.time_timeseries_year_apply
-      10.9±0.2μs      9.10±0.04μs     0.83  period.period_standard_indexing.time_get_loc
-           1.14s          939±2ms     0.82  reshape.reshape_unstack_large_single_dtype.time_unstack_with_mask
-        75.8±3ms         59.8±3ms     0.79  series_methods.series_isin_int64.time_series_isin_int64_large
-      67.4±0.3ms       52.2±0.2ms     0.77  parser_vb.read_csv_categorical.time_convert_post
-           146ms            112ms     0.77  gil.nogil_kth_smallest.time_nogil_kth_smallest
-           791ms        577±0.6ms     0.73  groupby.Groups.time_groupby_groups('int64_small')
-         433±4ms          303±3ms     0.70  timeseries.AsOfDataFrame.time_asof
-         448±9ms          308±6ms     0.69  timeseries.AsOfDataFrame.time_asof_nan
-      54.3±0.2ms      36.6±0.07ms     0.67  parser_vb.read_csv_categorical.time_convert_direct
-           1.48s            906ms     0.61  groupby.Groups.time_groupby_groups('int64_large')
-           1.57s            949ms     0.60  groupby.Groups.time_groupby_groups('object_small')
-        19.3±1ms       10.4±0.7ms     0.54  gil.nogil_read_csv.time_read_csv_object
-      31.0±0.3ms      15.6±0.09ms     0.51  parser_vb.read_csv2.time_comment
-           563ms            204ms     0.36  packers.JSON.time_write_json_mixed_float_int
-         422±4ms          128±2ms     0.30  panel_ctor.Constructors4.time_panel_from_dict_two_different_indexes
-           2.78s          758±1ms     0.27  index_object.Multi2.time_sortlevel_int64
-         873±9ms          227±4ms     0.26  reshape.reshape_pivot_time_series.time_reshape_pivot_time_series
-           793ms          204±4ms     0.26  frame_methods.frame_duplicated.time_frame_duplicated
-           1.14m            15.4s     0.23  gil.nogil_datetime_fields.time_datetime_field_year
-           391ms           85.8ms     0.22  packers.JSON.time_write_json_mixed_delta_int_tstamp
-           20.3s            3.64s     0.18  join_merge.MergeCategoricals.time_merge_object
-           3.85s          577±6ms     0.15  reshape.reshape_unstack_large_single_dtype.time_unstack_full_product
-           912ms          136±7ms     0.15  index_object.Multi1.time_duplicated
-       35.5±20ms         5.24±1ms     0.15  categoricals.Categoricals.time_union

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

@jreback
Copy link
Contributor

jreback commented Sep 6, 2017

so the benchmarks are telling u that something is amiss

these are generally petty stable

so run the ones that are out of whack and try to narrow it down

@jbrockmendel
Copy link
Member Author

asv continuous -f 1.1 -E virtualenv master HEAD -b indexing.MultiIndexing.time_is_monotonic
[...]
       before           after         ratio
     [36dadd70]       [c52c7968]
+         206±6ms         329±40ms     1.60  indexing.MultiIndexing.time_is_monotonic

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Run again immediately:

asv continuous -f 1.1 -E virtualenv master HEAD -b indexing.MultiIndexing.time_is_monotonic
[...]
       before           after         ratio
     [36dadd70]       [c52c7968]
-        400±20ms         197±20ms     0.49  indexing.MultiIndexing.time_is_monotonic

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

so the benchmarks are telling u that something is amiss

Something is amiss, but that is orthogonal to the PR, which is just a refactor.

@jreback
Copy link
Contributor

jreback commented Sep 7, 2017

Something is amiss, but that is orthogonal to the PR, which is just a refactor.

it may be, but until these show stable results and we can tell whether something is changed, can't merge these.

@jbrockmendel
Copy link
Member Author

Setting the affinity seems to have made a difference:

taskset 04 asv continuous -f 1.1 -E virtualenv master HEAD -b timeseries
[...]
       before           after         ratio
     [36dadd70]       [c52c7968]
+         321±2ms         459±20ms     1.43  timeseries.AsOfDataFrame.time_asof_nan
+     19.0±0.09μs       24.1±0.1μs     1.27  timeseries.Offsets.time_timeseries_year_incr
+         317±2ms          375±5ms     1.18  timeseries.AsOfDataFrame.time_asof
+      77.0±0.4μs       85.1±0.2μs     1.11  timeseries.AsOfDataFrame.time_asof_single_early
-     27.1±0.06μs      24.4±0.09μs     0.90  timeseries.Offsets.time_custom_bday_cal_incr
-      19.5±0.1μs      16.9±0.09μs     0.87  timeseries.Offsets.time_custom_bday_apply_dt64
-         519±2ms        438±0.7ms     0.84  timeseries.ToDatetime.time_format_no_exact

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Immediate re-run of the ones that showed a change

taskset 04 asv continuous -f 1.1 -E virtualenv master HEAD -b timeseries.AsOfDataFrame -b timeseries.Offsets -b timeseries.ToDatetime
[...]
       before           after         ratio
     [36dadd70]       [c52c7968]
+         324±3ms          449±2ms     1.39  timeseries.AsOfDataFrame.time_asof_nan
+     15.6±0.07μs      19.5±0.04μs     1.25  timeseries.Offsets.time_custom_bday_apply
+     19.4±0.03μs      22.4±0.06μs     1.16  timeseries.Offsets.time_timeseries_day_apply
+      20.7±0.1μs      23.9±0.09μs     1.15  timeseries.Offsets.time_timeseries_day_incr
+     28.7±0.09μs       31.9±0.2μs     1.11  timeseries.Offsets.time_custom_bday_cal_incr_neg_n
-      28.2±0.1μs      25.4±0.09μs     0.90  timeseries.Offsets.time_custom_bday_cal_incr_n
-     30.6±0.09μs      27.5±0.03μs     0.90  timeseries.Offsets.time_custom_bday_decr
-         252±6μs          225±3μs     0.89  timeseries.Offsets.time_custom_bmonthbegin_decr_n

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
taskset 05 asv continuous -f 1.1 -E virtualenv master HEAD -b timeseries.AsOfDataFrame -b timeseries.Offsets -b timeseries.ToDatetime
[...]
       before           after         ratio
     [36dadd70]       [c52c7968]
+       220±0.6μs         246±30μs     1.12  timeseries.Offsets.time_custom_bmonthbegin_decr_n
+       212±0.1μs          236±1μs     1.11  timeseries.Offsets.time_custom_bmonthend_incr_n
+        2.99±0ms      3.30±0.01ms     1.10  timeseries.ToDatetime.time_iso8601
-     20.5±0.04μs      18.2±0.06μs     0.89  timeseries.Offsets.time_custom_bday_incr
-      33.7±0.1μs      28.8±0.06μs     0.86  timeseries.Offsets.time_custom_bday_cal_incr_neg_n

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

Of these, the only one that looks related to this PR is timeseries.ToDatetime.time_iso8601

@jbrockmendel
Copy link
Member Author

For each of parsing, strptime, frequencies, timezones, and offsets, if and when these refactorings are merged I'll put together targeted tests and benchmarks for each module. Modularity FTW.

Made this locally long ago; for some reason it is not getting reflected on GH when I push.  It's a mystery.
@jreback
Copy link
Contributor

jreback commented Sep 8, 2017

needs a rebase

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase as well

@@ -0,0 +1,682 @@
# -*- coding: utf-8 -*-
# cython: profile=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in tslibs/parsing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is it possible that ci is green given that all references in setup.py refer to tslibs/parsing.pyx ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the file is just duplicated in the diff, that's why it still works :-).
@jbrockmendel so this file can just be removed (if they are identical)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get asv to run a while back I had to move tslibs/parsing.pyx to just parsing.pyx. This is an artifact that shouldn't have gotten pushed. Will remove.



class DateParseError(ValueError):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an error that is raised to users or only used internally?
In the first case we should add this to pandas.errors

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be raised externally. Is currently in tslib. I think there's been some effort to avoid importing non-cython modules into _libs.foo

_format_is_iso,
_DATEUTIL_LEXER_SPLIT,
_guess_datetime_format,
NAT_SENTINEL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this import needed? (it was not here before)

parse_datetime_string,
parse_time_string,
_does_string_look_like_datetime,
parse_datetime_string_with_reso)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also those ones. Do you just import them to keep them in the namespace of tslib as before? Are they actually used from here somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these are just for backward compat in case some user somewhere is using them. I'll be happy to remove those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above. don't import these in tslib.pyx unless they are actually used, rather change to directly import

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u address these comments

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, just pushed. Retained the from tslibs.parsing import DateParseError # noqa as it is referenced in a handful of places around pandas (and conceivably downstream), removed all the others.

@@ -0,0 +1,682 @@
# -*- coding: utf-8 -*-
# cython: profile=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the file is just duplicated in the diff, that's why it still works :-).
@jbrockmendel so this file can just be removed (if they are identical)

return res


def parse_datetime_string_with_reso(date_string, freq=None, dayfirst=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe not for this PR if we want to keep it 'moving only', but after moving it would make sense to me to combine this with the function above, as this is the only place where it is used

return dt


def parse_time_string(arg, freq=None, dayfirst=None, yearfirst=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know what the difference is between this parse_time_string and the parse_datetime_string above?
From a superficial look they seem very similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do not tackle that in this PR, would you like to open an issue for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added this to #17652

try_parse_date_and_time,
try_parse_year_month_day,
try_parse_datetime_components)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you actually need to import these here, rather import them directly where they are used

parse_datetime_string,
parse_time_string,
_does_string_look_like_datetime,
parse_datetime_string_with_reso)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above. don't import these in tslib.pyx unless they are actually used, rather change to directly import

class DateParseError(ValueError):
pass

_nat_strings = set(['NaT', 'nat', 'NAT', 'nan', 'NaN', 'NAN'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_nat_strings should be in datetime.pxd (or maybe in util.pxd)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the recent build-based threads, I'd like to (at least for now) avoid introducing this dependency, especially for something this small. Eventually you're right we'll want to put these constants all in one place.


cdef set _not_datelike_strings = set(['a', 'A', 'm', 'M', 'p', 'P', 't', 'T'])

NAT_SENTINEL = object()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

return ret, reso


# The canonical place for this appears to be in frequencies.pyx.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you add this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a note to self, can be removed for now. Eventually this function will belong in tslibs.frequencies.



#----------------------------------------------------------------------
# Miscellaneous functions moved from core.tools.datetimes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a useful comment (the miscellanous is fine)

setup.py Outdated
@@ -485,6 +486,7 @@ def pxd(name):
'sources': ['pandas/_libs/src/datetime/np_datetime.c',
'pandas/_libs/src/datetime/np_datetime_strings.c',
'pandas/_libs/src/period_helper.c']},
'_libs.tslibs.parsing': {'pyxfile': '_libs/tslibs/parsing'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont' see any reason that parsing should not depend on util, otherwise you are adding boilerplate

Copy link
Member Author

@jbrockmendel jbrockmendel Sep 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does depend on util, but specifying util here is redundant with the common_include that gets pinned on later.

If we changed this entry to:

'_libs.tslibs.parsing': {'pyxfile': '_libs/tslibs/parsing', 'pxdfiles': ['_libs/src/util'], 'include': []},

we would get build errors (or possibly import-time) because util requires "src/headers/stdint.h" and "src/numpy_helper.h". And numpy_helper requires "helper.h". I'd prefer to specify these all explicitly and not have the common_include tacked on, but haven't found a way to make this work.

Update On another look, I don't see any reference to util. Maybe you're thinking of the strptime PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add util.pxd back into this in the interest of getting this merged; we can figure out what is and isn't necessary in the other threads.

@jbrockmendel
Copy link
Member Author

GH won't let me respond inline to ? comment following NAT_SENTINEL. Is this a placeholder for NaT to avoid circular import to/from tslib.

…libs-parsing

Remove unused imports from tslib; test file to import directly
# cython: profile=False
# cython: linetrace=False
# distutils: define_macros=CYTHON_TRACE=0
# distutils: define_macros=CYTHON_TRACE_NOGIL=0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose those were for debugging? (if set to True)
We don't keep those anywhere else (apart from the profile=False), so not sure we should do it here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be happy to remove these if requested. I habitually keep them around so I don't have to remember what the options are. Say the word and they're gone.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import comment and some future suggests. revise and ping on green.


cdef set _not_datelike_strings = set(['a', 'A', 'm', 'M', 'p', 'P', 't', 'T'])

NAT_SENTINEL = object()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we are now defining NAT_SENTINEL in multiple places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, no.

from pandas._libs.tslibs import parsing
from pandas._libs.tslibs.parsing import ( # noqa
parse_time_string,
_format_is_iso,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can come back in future and de-privatize these in tslib.parsing (and change here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added to TODO list in #17652

from pandas._libs.tslibs.parsing import ( # noqa
parse_time_string,
_format_is_iso,
_DATEUTIL_LEXER_SPLIT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _DATEUTIL_LEXER_SPLIT Is now private to tstlib.parsing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed a commit that removes this import.

@jreback jreback added this to the 0.21.0 milestone Sep 24, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. ping when green.

@@ -66,6 +66,9 @@ from khash cimport (
kh_init_int64, kh_int64_t,
kh_resize_int64, kh_get_int64)

from .tslibs.parsing import parse_datetime_string
from .tslibs.parsing import DateParseError # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is DateParseError actually used anywhere here? it doesn't look like it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Retained it because it is used/imported in many places around pandas (and conceivably downstream). Will change on request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, ok let's fix that in a followup then.

from numpy cimport int64_t, ndarray
np.import_array()

# Avoid import from outside _libs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this to the Todo list as well

@jbrockmendel jbrockmendel mentioned this pull request Sep 25, 2017
59 tasks
@jbrockmendel
Copy link
Member Author

ping

@jreback
Copy link
Contributor

jreback commented Sep 25, 2017

ok, let's rebase

@jreback jreback merged commit 7e87385 into pandas-dev:master Sep 26, 2017
@jreback
Copy link
Contributor

jreback commented Sep 26, 2017

thanks! pls add the followups to the list.

@jbrockmendel jbrockmendel deleted the tslibs-parsing branch October 30, 2017 16:23
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants