-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: 2D ndarray of dtype 'object' is always copied upon construction #39272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2ed6a20
to
40b0cea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will have to measure performance if it changes in any way. run the asv suite. and see which benchmarks are affected.
I am not sure how this change actually helps anything. e.g. how does sharing memory actually help here? (because the second you do almost anything these blocks will likely get split apart and copied again).
so having a specific benchmark is really important here. sharing memory is not that interesting as its fleeting.
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -231,6 +231,7 @@ Datetimelike | |||
- Bug in :meth:`DatetimeIndex.intersection`, :meth:`DatetimeIndex.symmetric_difference`, :meth:`PeriodIndex.intersection`, :meth:`PeriodIndex.symmetric_difference` always returning object-dtype when operating with :class:`CategoricalIndex` (:issue:`38741`) | |||
- Bug in :meth:`Series.where` incorrectly casting ``datetime64`` values to ``int64`` (:issue:`37682`) | |||
- Bug in :class:`Categorical` incorrectly typecasting ``datetime`` object to ``Timestamp`` (:issue:`38878`) | |||
- Bug in :func:`pandas.core.internals.construction.init_ndarray` unnecessarily copying all object arrays after datetime inference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't add a whatsnew for internal functions, rather you want to note what a user would care about. so you can say about the same thing, but its about the DataFrame constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number
make_block(dvals_list[n], placement=[n], ndim=2) | ||
for n in range(len(dvals_list)) | ||
] | ||
i = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is not great. we already do this type of dtype grouping inside the block manager constructors. much better to do there. However this might be non-trivial and/or might have a bit impact performance.
so i would try to re-use one of the existing grouping rountines here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point me to the module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think we have anything quite like this, in part because i wrote something similar in #33597
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if you dont do the make_block calls here, but instead keep construct the arrays, and then instead of calling create_block_manager_from_blocks at the end could call create_block_manager_from_arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you post one of the actual tracebacks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these were a missing ValueError in _form_blocks
, but it still ended up copying everything – that's just how create_block_manager_from_arrays
is implemented. I've implemented a new _from_array
constructor, which will fall back to _form_blocks
in the case of a datetime conversion, but otherwise doesn't copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to avoid adding a kwarg to the constructor, but it seemed like that would get messier.
side note, after #38939 we can also avoid consolidating in this case, which would avoid making copies |
) | ||
else: | ||
block_values.extend( | ||
make_block( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens here if we have two consecutive datetime64tz columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In[5]: block_values
Out[5]:
[DatetimeTZBlock: slice(0, 1, 1), 1 x 1, dtype: datetime64[ns, UTC],
DatetimeTZBlock: slice(1, 2, 1), 1 x 1, dtype: datetime64[ns, UTC],
DatetimeTZBlock: slice(2, 3, 1), 1 x 1, dtype: datetime64[ns, UTC]]
From what I can gather, the problem is that not all (but some) datetimelike types are Edit: ignore this comment |
Im guessing the issue is trying to do |
560f894
to
322a677
Compare
57d8c6b
to
1d95a25
Compare
I'm a core developer of https://github.com/biolab/orange3, focusing on UX. Orange's data objects are also based on numpy arrays, Orange's While you're right in that most operations will split it apart anyway, in my use case, it'll avoid an unnecessary copy operation.
On it. If none of the benchmarks change, I'll add one. |
cd2281e
to
4ef2e7e
Compare
Let me know if you prefer to change As the function is used elsewhere too, it makes a test fail: def test_constructor_series_copy(self, float_frame):
series = float_frame._series
df = DataFrame({"A": series["A"]})
df["A"][:] = 5
assert not (series["A"] == 5).all() # False So, def test_constructor_series_copy(self, float_frame):
series = float_frame._series
df = DataFrame({"A": series["A"], "B": series["A"]})
df["A"][:] = 5
assert not (series["A"] == 5).all() # True |
pandas/core/internals/managers.py
Outdated
names_idx = names | ||
if names_idx.equals(axes[0]): | ||
names_indexer = np.arange(len(names_idx)) | ||
if len(arrays) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sure needs comments, and even better pls separate this out to a new function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this looks almost like a groupby to me. i am puzzled why the axes are involved here at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how it was already written. It finds all variables of the same block type, and makes blocks out of them. If you pass in more arrays than names, only those arrays whose names were specified are blockified.
I could rewrite this into a groupby, I guess. It'll create more contiguous blocks than it does currently.
If you look at create_block_manager_from_blocks
, you'll see that it allows passing a singular array (not block), and it'll create a contiguous block for it. I could do this in create_block_manager_from_arrays
, but IIRC that made a lot of things fail, like init_dict
, which uses the same constructor. Maybe #38939 will fix that.
I still think making a new constructor would be more elegant. It would allow removal of this weird bit in create_block_manager_from_blocks
too:
def create_block_manager_from_blocks(blocks, axes: List[Index]) -> BlockManager:
try:
if len(blocks) == 1 and not isinstance(blocks[0], Block):
# if blocks[0] is of length 0, return empty blocks
if not len(blocks[0]):
blocks = []
else:
# It's OK if a single block is passed as values, its placement
# is basically "all items", but if there're many, don't bother
# converting, it's an error anyway.
blocks = [
make_block(
values=blocks[0], placement=slice(0, len(axes[0])), ndim=2
)
]
...
It's unintuitive to pass something that's not a block to a method suffixed with _from_blocks
.
pandas/core/internals/managers.py
Outdated
|
||
if len(arrays) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add omments
4ef2e7e
to
d2facee
Compare
@@ -1845,6 +1845,14 @@ def test_constructor_series_copy(self, float_frame): | |||
|
|||
assert not (series["A"] == 5).all() | |||
|
|||
def test_object_array_does_not_copy(self): | |||
a = np.array(["a", "b"], dtype="object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
assert np.shares_memory(df.values, a) | ||
df2 = DataFrame(b) | ||
assert np.shares_memory(df2.values, b) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert the frames are equal as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean assert the numpy arrays equal? They're not equal, but they do share memory:
> np.testing.assert_array_equal(df.values, a)
E AssertionError:
E Arrays are not equal
E
E (shapes (2, 1), (2,) mismatch)
E x: array([['a'],
E ['b']], dtype=object)
E y: array(['a', 'b'], dtype=object)
(note also I've split the test into two)
c0e144f
to
d271764
Compare
3cc977c
to
99aca53
Compare
I finally figured out what was failing. It was the new ArrayManager. I just needed to add The way the new array data management is set up currently is such that it copies all incoming numpy arrays. Our use case for pandas relies on 'wrapping' numpy arrays without copying them. Please consider this when enforcing the new array data manager. Although I'm assuming that this won't be standard for a while. Until then, please accept this contribution which fixes the 2D object array copy bug in block manager. Note, I've also refactored that bit and made a (also not sure why it says pandas-dev.pandas failed, all the tests passed) |
IIRC there was some discussion about this copy being avoided, not sure what happened to that. Opening an issue about it might be worthwhile. |
# don't convert (and copy) the objects if no type inference occurs | ||
if any( | ||
not is_dtype_equal(instance.dtype, array.dtype) | ||
for instance in maybe_datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double-check me on this, but i think a more performant alternative would be
obj_columns = [x for x in array]
maybe_datetime = [maybe_infer_to_datetimelike(x) for x in obj_columns]
if any(x is not y for x, y in zip(obj_columns, maybe_datetime):
[...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm running asv to see if there's a performance difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the full asv run comparing the last commit (your suggestion) and the second to last commit. Any of these in particular you'd like me to rerun?
before after ratio
[dc2ae20d] [96dd1b9e]
<2d-object-dont-copy> <2d-object-dont-copy~1>
+ 4.43±0.01ms 94.2±0.3ms 21.27 indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'uniqu
e_monotonic_inc')
+ 59.0±3ms 75.8±0.4ms 1.28 inference.ToTimedeltaErrors.time_convert('coerce')
+ 45.8±5ms 58.5±5ms 1.28 gil.ParallelGroupbyMethods.time_loop(2, 'last')
+ 4.37±0.2ms 5.34±0.3ms 1.22 rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'max')
+ 5.31±0.1ms 6.45±0.5ms 1.21 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'sum')
+ 4.46±0.1ms 5.40±0.3ms 1.21 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'mean')
+ 5.38±0.2ms 6.51±0.6ms 1.21 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'max')
+ 147±9ms 178±1ms 1.21 inference.ToDatetimeISO8601.time_iso8601_tz_spaceformat
+ 72.1±5ms 86.7±1ms 1.20 inference.ToDatetimeCache.time_dup_string_tzoffset_dates(False)
+ 4.57±0.09ms 5.47±0.3ms 1.20 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'max')
+ 5.23±0.2ms 6.23±0.4ms 1.19 rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'mean')
+ 5.40±0.1ms 6.42±0.5ms 1.19 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'median')
+ 4.29±0.1ms 5.10±0.3ms 1.19 rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'mean')
+ 4.52±0.2ms 5.37±0.2ms 1.19 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'median')
+ 4.45±0.2ms 5.28±0.4ms 1.19 rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'mean')
+ 15.4±0.6μs 18.2±0.3μs 1.19 series_methods.NanOps.time_func('sum', 1000, 'boolean')
+ 4.39±0.1ms 5.20±0.2ms 1.19 rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'max')
+ 5.25±0.2ms 6.23±0.4ms 1.18 rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'median')
+ 5.20±0.2ms 6.16±0.3ms 1.18 rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'sum')
+ 4.37±0.2ms 5.17±0.3ms 1.18 rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'median')
+ 4.53±0.04ms 5.35±0.3ms 1.18 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'min')
+ 4.35±0.1ms 5.13±0.3ms 1.18 rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'sum')
+ 4.40±0.1ms 5.18±0.3ms 1.18 rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'median')
+ 5.25±0.2ms 6.17±0.2ms 1.17 rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'min')
+ 5.39±0.2ms 6.32±0.4ms 1.17 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'min')
+ 58.8±2ms 68.9±7ms 1.17 inference.ToTimedeltaErrors.time_convert('ignore')
+ 4.42±0.2ms 5.17±0.2ms 1.17 rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'min')
+ 5.33±0.1ms 6.23±0.2ms 1.17 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'min')
+ 5.38±0.3ms 6.27±0.4ms 1.17 rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'max')
+ 4.73±0.3ms 5.51±0.4ms 1.17 rolling.Engine.time_expanding_apply('Series', 'int', <function sum>, 'cython', 'max')
+ 5.20±0.1ms 6.06±0.3ms 1.16 rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'min')
+ 4.60±0.2ms 5.36±0.3ms 1.16 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'max')
+ 13.2±0.4μs 15.3±0.3μs 1.16 series_methods.NanOps.time_func('max', 1000, 'boolean')
+ 4.40±0.06ms 5.11±0.4ms 1.16 rolling.Engine.time_rolling_apply('Series', 'float', <function sum>, 'cython', 'sum')
+ 4.57±0.1ms 5.29±0.3ms 1.16 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'sum')
+ 4.67±0.09ms 5.41±0.2ms 1.16 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function sum>, 'cython', 'mean')
+ 5.39±0.1ms 6.23±0.5ms 1.16 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'mean')
+ 5.21±0.1ms 6.02±0.4ms 1.16 rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'max')
+ 4.59±0.03ms 5.29±0.2ms 1.15 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'median')
+ 5.31±0.1ms 6.12±0.3ms 1.15 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'mean')
+ 4.55±0.03ms 5.24±0.3ms 1.15 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'sum')
+ 5.37±0.1ms 6.19±0.4ms 1.15 rolling.Engine.time_rolling_apply('DataFrame', 'int', <function Engine.<lambda>>, 'cython', 'sum')
+ 16.4±0.7μs 18.7±1μs 1.14 series_methods.SearchSorted.time_searchsorted('uint8')
+ 5.26±0.1ms 5.99±0.3ms 1.14 rolling.Engine.time_rolling_apply('Series', 'float', <function Engine.<lambda>>, 'cython', 'median')
+ 4.64±0.1ms 5.29±0.3ms 1.14 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function sum>, 'cython', 'min')
+ 1.17±0.01ms 1.33±0.07ms 1.14 hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 5
00000)
+ 2.58±0.03μs 2.93±0.08μs 1.13 indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Float64Engine'>, <class 'numpy.f
loat64'>), 'monotonic_incr')
+ 5.21±0.2ms 5.91±0.3ms 1.13 rolling.Engine.time_rolling_apply('Series', 'int', <function Engine.<lambda>>, 'cython', 'mean')
+ 4.28±0ms 4.84±0.5ms 1.13 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 10000, datetime.timezone(datetime.timedelta(se
conds=3600)))
+ 4.51±0.2ms 5.10±0.2ms 1.13 rolling.Engine.time_rolling_apply('Series', 'int', <function sum>, 'cython', 'min')
+ 5.50±0.2ms 6.21±0.2ms 1.13 rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda>>, 'cython', 'max')
+ 16.5±0.4μs 18.6±1μs 1.13 series_methods.SearchSorted.time_searchsorted('int16')
+ 2.57±0.01μs 2.89±0.2μs 1.13 indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int64Engine'>, <class 'numpy.int
64'>), 'monotonic_incr')
+ 452±0.8ms 508±50ms 1.12 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 1000000, datetime.timezone(datetime.timedelta(
seconds=3600)))
+ 806±20μs 904±40μs 1.12 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function ne>)
+ 3.20±0.04μs 3.59±0.2μs 1.12 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'start', 'month', 'QS', 3)
+ 13.2±0.5μs 14.8±0.6μs 1.12 series_methods.NanOps.time_func('min', 1000, 'boolean')
+ 121±5μs 136±4μs 1.12 multiindex_object.GetLoc.time_large_get_loc
+ 47.6±0.2ms 53.0±1ms 1.11 rolling.TableMethod.time_apply('single')
+ 4.17±0.02μs 4.64±0.2μs 1.11 categoricals.CategoricalSlicing.time_getitem_scalar('non_monotonic')
+ 15.9±0.3ms 17.7±0.2ms 1.11 inference.ToTimedelta.time_convert_string_days
+ 741±20μs 824±20μs 1.11 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function eq>)
+ 732±20μs 813±20μs 1.11 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function le>)
+ 3.32±0.03μs 3.67±0.2μs 1.11 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'start', 'year', 'B', 5)
+ 3.44±0.02μs 3.81±0.3μs 1.11 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'start', 'year', 'B', 5)
+ 113±0.7μs 125±1μs 1.11 indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monoton
ic_inc')
+ 4.16±0.02μs 4.60±0.4μs 1.10 categoricals.CategoricalSlicing.time_getitem_scalar('monotonic_incr')
+ 13.0±0.2μs 14.3±0.2μs 1.10 series_methods.NanOps.time_func('sum', 1000, 'Int64')
+ 60.3±1ms 66.6±1ms 1.10 rolling.Groupby.time_rolling_int('mean')
+ 14.4±0.1μs 15.9±0.5μs 1.10 series_methods.NanOps.time_func('min', 1000, 'Int64')
+ 60.7±1ms 67.0±1ms 1.10 rolling.Groupby.time_rolling_int('max')
+ 61.5±1ms 67.9±1ms 1.10 rolling.Groupby.time_rolling_int('median')
+ 544±20ns 600±30ns 1.10 index_cached_properties.IndexCache.time_inferred_type('Int64Index')
+ 15.8±0.04μs 17.4±0.8μs 1.10 inference.MaybeConvertObjects.time_maybe_convert_objects
+ 437±0.7ms 482±40ms 1.10 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 1000000, None)
+ 60.5±2ms 66.6±1ms 1.10 rolling.Groupby.time_rolling_int('sum' (0))
+ 3.36±0.01μs 3.70±0.2μs 1.10 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'end', 'year', 'B', 12)
+ 3.24±0.04μs 3.57±0.2μs 1.10 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(0, 'start', 'quarter', 'QS', 5)
+ 142±2μs 157±1μs 1.10 indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_mo
notonic_inc')
+ 41.1±0.3ms 45.2±1ms 1.10 frame_methods.Isnull.time_isnull_obj
+ 3.47±0.02μs 3.82±0.2μs 1.10 tslibs.fields.TimeGetStartEndField.time_get_start_end_field(1, 'end', 'quarter', 'QS', 12)
- 8.68±0.5μs 7.89±0.02μs 0.91 tslibs.resolution.TimeResolution.time_get_resolution('h', 0, datetime.timezone(datetime.timedelta(seconds=3600))
)
- 8.79±0.6μs 7.98±0.04μs 0.91 tslibs.resolution.TimeResolution.time_get_resolution('D', 1, datetime.timezone(datetime.timedelta(seconds=3600))
)
- 18.0±1μs 16.3±0.05μs 0.91 tslibs.resolution.TimeResolution.time_get_resolution('ns', 1, tzlocal())
- 9.93±0.6μs 9.01±0.02μs 0.91 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 4000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 1.10±0.09s 998±5ms 0.91 arithmetic.OffsetArrayArithmetic.time_add_series_offset(<CustomBusinessMonthEnd> (0))
- 2.71±0.2μs 2.46±0μs 0.91 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('date', 0, None)
- 9.98±0.7μs 9.05±0.03μs 0.91 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 2.75±0.2μs 2.50±0.02μs 0.91 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 0, tzlocal())
- 2.72±0.2μs 2.46±0μs 0.91 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, None)
- 9.98±0.5μs 9.03±0.09μs 0.91 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 5000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 9.11±0.3μs 8.24±0.3μs 0.90 tslibs.timestamp.TimestampProperties.time_weekday_name(tzlocal(), None)
- 9.91±0.6μs 8.96±0.1μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 12000, datetime.timezone(datetime.timedelta(se
conds=3600)))
- 10.0±0.6μs 9.06±0.04μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 3000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 10.1±0.8μs 9.09±0.03μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 8000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 9.94±0.7μs 8.99±0.1μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 10000, datetime.timezone(datetime.timedelta(se
conds=3600)))
- 19.8±1μs 17.9±0.05μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(100, 2000, <DstTzInfo 'US/Pacific' LMT-1 day, 16:
07:00 STD>)
- 1.11±0.09s 998±3ms 0.90 arithmetic.OffsetArrayArithmetic.time_add_dti_offset(<CustomBusinessMonthEnd> (0))
- 9.91±0.6μs 8.95±0.08μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 7000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 9.99±0.5μs 9.01±0.09μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 2.58±0.2μs 2.32±0.01μs 0.90 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 0, datetime.timezone.utc)
- 9.97±0.7μs 8.96±0.06μs 0.90 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 11000, datetime.timezone(datetime.timedelta(se
conds=3600)))
- 29.3±0.3ms 26.2±0.5ms 0.90 rolling.Apply.time_rolling('Series', 3, 'int', <built-in function sum>, False)
- 10.1±0.7μs 9.03±0.06μs 0.89 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 9000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 2.60±0.2μs 2.32±0.01μs 0.89 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, datetime.timezone.utc)
- 10.0±0.7μs 8.98±0.04μs 0.89 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 8.65±0.2μs 7.73±0.1μs 0.89 tslibs.period.TimePeriodArrToDT64Arr.time_periodarray_to_dt64arr(1, 7000)
- 9.99±0.7μs 8.93±0.05μs 0.89 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 2011, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 26.7±2μs 23.8±0.2μs 0.89 tslibs.timestamp.TimestampOps.time_normalize(tzlocal())
- 2.77±0.2μs 2.47±0.02μs 0.89 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('time', 0, None)
- 6.29±0.6μs 5.60±0.08μs 0.89 tslibs.timedelta.TimedeltaConstructor.time_from_unit
- 11.6±0.8μs 10.3±0.1μs 0.89 tslibs.timestamp.TimestampOps.time_replace_None(datetime.timezone(datetime.timedelta(seconds=3600)))
- 10.1±0.6μs 8.95±0.06μs 0.89 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 10000, datetime.timezone(datetime.timedelta(se
conds=3600)))
- 5.64±0.3μs 5.00±0.04μs 0.89 tslibs.timestamp.TimestampOps.time_normalize(None)
- 58.8±3μs 51.9±0.6μs 0.88 tslibs.timestamp.TimestampOps.time_normalize(tzfile('/usr/share/zoneinfo/Asia/Tokyo'))
- 3.19±0.3μs 2.80±0.02μs 0.88 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('date', 1, None)
- 9.72±1μs 8.54±0.05μs 0.88 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, datetime.timezone(datetime.timedelta(s
econds=3600)))
- 2.85±0.2μs 2.50±0.01μs 0.88 tslibs.tslib.TimeIntsToPydatetime.time_ints_to_pydatetime('timestamp', 0, tzlocal())
- 1.27±0.03ms 1.11±0.06ms 0.87 arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function mul>)
- 15.5±2μs 13.6±0.2μs 0.87 tslibs.timedelta.TimedeltaConstructor.time_from_components
- 8.74±1μs 7.57±0.05μs 0.87 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 6000, <DstTzInfo 'US/Pacific' LMT-1 day, 16:07
:00 STD>)
- 8.05±0.8μs 6.92±0.03μs 0.86 tslibs.timestamp.TimestampOps.time_replace_None(<DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>)
- 1.59±0.01ms 1.36±0.01ms 0.85 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(100, 4000, tzlocal())
- 6.61±0.6μs 5.65±0.08μs 0.85 tslibs.timedelta.TimedeltaConstructor.time_from_string
- 24.1±2μs 20.5±0.1μs 0.85 tslibs.timestamp.TimestampOps.time_normalize(datetime.timezone(datetime.timedelta(seconds=3600)))
- 11.7±0.05ms 9.84±0.06ms 0.84 indexing.NumericSeriesIndexing.time_loc_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_mon
otonic_inc')
- 1.66±0.1ms 1.39±0.07ms 0.84 series_methods.NanOps.time_func('prod', 1000000, 'int8')
- 10.8±0.2μs 9.05±0.04μs 0.84 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(0, 4006, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 10.7±0.3μs 8.95±0.2μs 0.83 tslibs.period.TimeDT64ArrToPeriodArr.time_dt64arr_to_periodarr(1, 1011, datetime.timezone(datetime.timedelta(sec
onds=3600)))
- 1.76±0.2ms 1.46±0.3ms 0.83 index_cached_properties.IndexCache.time_engine('MultiIndex')
- 57.4±0.4ms 34.8±0.07ms 0.61 algos.isin.IsinWithArange.time_isin(<class 'numpy.object_'>, 8000, 0)
- 93.1±1ms 400±3μs 0.00 hash_functions.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 1000000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reran time_getitem_list_like, now it swung in the other direction, seems like it's just doing whatever it feels like.
before after ratio
[dc2ae20d] [96dd1b9e]
<2d-object-dont-copy> <2d-object-dont-copy~1>
- 59.7±0.08ms 3.17±0.02ms 0.05 indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
This is what it looks like now: if dtype is None and is_object_dtype(values.dtype):
obj_columns = [x for x in values]
maybe_datetime = [maybe_infer_to_datetimelike(x) for x in obj_columns]
# don't convert (and copy) the objects if no type inference occurs
if any(x is not y for x, y in zip(obj_columns, maybe_datetime)):
dvals_list = [ensure_block_shape(dval, 2) for dval in maybe_datetime]
block_values = [
new_block(dvals_list[n], placement=n, ndim=2)
for n in range(len(dvals_list))
]
else:
nb = new_block(values, placement=slice(len(columns)), ndim=2)
block_values = [nb]
else:
nb = new_block(values, placement=slice(len(columns)), ndim=2)
block_values = [nb] Still not too pretty because of the double else. We could avoid it by abstracting the boolean condition into a separate private function. However without a walrus operator, we'd need to call |
im not bothered by this at all |
thanks @irgolic thanks for sticking with it! very nice! |
This change is aesthetically unpleasing, it indents too far. #26825 already suggested to separate this part into a function. Please let me know what exactly I should abstract into a function, and where to put it. Alternatively, amend this PR yourself.
Thanks for pandas. :)