- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 19.2k
Description
Two proposals:
Consolidate all inference to the Index constructor
- Retain Index(...)inferring the best container for the data passed
- Remove MultiIndex(data)returning anIndexwhen data is a list of length-1 tuples (xref API: Have MultiIndex consturctors always return a MI #17236)
Passing dtype=object disables inference
Index(..., dtype=object) disable all inference. So Index([1, 2], dtype=object) will give you an Index instead of Int64Index, and Index([(1, 'a'), (2, 'b')], dtype=object) an Index instead of MultiIndex, etc.
(original post follows)
Or how much magic should we have in the Index constructors? Currently we infer the index type from the data, which is often convenient, but sometime difficult to reason able behavior. e.g. hash_tuples currently doesn't work if your tuples all happen to be length 1, since it uses a MultiIndex internally.
Do we want to make our Index constructors more predictable? For reference, here are some examples:
>>> import pandas as pd
# 1.) Index -> MultiIndex
>>> pd.Index([(1, 2), (3, 4)])
MultiIndex(levels=[[1, 3], [2, 4]],
           labels=[[0, 1], [0, 1]])
>>> pd.Index([(1, 2), (3, 4)], tupleize_cols=False)
Index([(1, 2), (3, 4)], dtype='object')
# 2.) Index -> Int64Index
>>> pd.Index([1, 2, 3, 4, 5])
Int64Index([1, 2, 3, 4, 5], dtype='int64')
# 3.) Index -> RangeIndex
>>> pd.Index(range(1, 5))
RangeIndex(start=1, stop=5, step=1)
# 4.) Index -> DatetimeIndex
>>> pd.Index([pd.Timestamp('2017'), pd.Timestamp('2018')])
DatetimeIndex(['2017-01-01', '2018-01-01'], dtype='datetime64[ns]', freq=None)
# 5.) Index -> IntervalIndex
>>> pd.Index([pd.Interval(3, 4), pd.Interval(4, 5)])
IntervalIndex([(3, 4], (4, 5]]
              closed='right',
              dtype='interval[int64]')
# 5.) MultiIndex -> Index
>>> pd.MultiIndex.from_tuples([(1,), (2,), (3,)])
Int64Index([1, 2, 3], dtype='int64')Of these, I think the first (Index -> MultiIndex if you have tuples) and the last (MultiIndex -> Index if you're tuples are all length 1) are undesirable. The Index -> MultiIndex one has the tupleize_cols keyword to control this behavior. In #17236 I add an analogous keyword to the MI constructor. The rest are probably fine, but I don't have any real reason for saying that [1, 2, 3] magically returning an Int64Index is ok, but [(1, 2), (3, 4)] returning a MI isn't (maybe the difference between a MI and Index is larger than the difference between an Int64Index and Index?). I believe that in either the RangeIndex or IntervalIndex someone (@shoyer?) had objections to overloading the Index constructor to return the specialized type.
So, what should we do about these? Leave them as is? Deprecate the type inference? My vote is for merging #17236 and leaving everything else as is. To me, it's not worth breaking API over.