Skip to content

DOC: Clarify 'public-ish' API for packages using pandas. #5460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jtratner opened this issue Nov 7, 2013 · 6 comments
Closed

DOC: Clarify 'public-ish' API for packages using pandas. #5460

jtratner opened this issue Nov 7, 2013 · 6 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@jtratner
Copy link
Contributor

jtratner commented Nov 7, 2013

I've been contributing a little bit to GeoPandas / working on some of my own custom code using pandas that I want to share with others and I realized that I wasn't sure what is and is not part of the public API for pandas.

Here's what I have in my head:

Definitely public:

  • anything in pandas toplevel namespace (except for modules imported into that namespace)
  • non-_ classmethods on NDFrame
  • DateOffsets (in that they have a set interface), but not necessarily direct use of get_offset and friends

Somewhat public:

  • Subset of Index methods:
    • Definitely union, intersection, difference, levels, labels, names (and the set methods for them)
    • Somewhat: get_indexer, get_indexer_non_unique, groupby, get_loc, slice_locs, equals, identical, values property
    • Not public: is_, is_unique, lexsort properties on MI
  • get_offset and DateOffset subclasses

Nice to make public right now:

  • compat module (useful to provide this for modules that depend on pandas...at least in terms of guaranteeing that whatever's in the namespace now will go through a deprecation period).
  • test utils (otherwise have to reinvent the wheel)
  • cache_readonly decorator

Internal methods/properties that subclasses may rely on/manipulate:

  • __finalize__
  • _metadata property
  • _constructor, _constructor_sliced
  • _internal_names (maybe - weird behavior with __setattr__ if you don't do this)
  • _reset_cache, _update_inplace, possibly _maybe_update_cacher

Things we could consider making public in the future:

  • many of the is_* methods in core/common.

Am I missing anything here? Does this make sense?

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014
@hayd
Copy link
Contributor

hayd commented Feb 25, 2014

+1 on making those ones you mention public right now, and esp. things which subclasses may rely on.

@cpcloud
Copy link
Member

cpcloud commented May 27, 2014

@shoyer want to continue the conversation here from #7243? Interested to hear what kind of indexes you want to create

@shoyer
Copy link
Member

shoyer commented May 28, 2014

OK, let me give two concrete examples of new types of indices I'd like to be able to integrate with pandas:

  1. CellIndex: an index where individual values correspond to intervals between start and stop value (given by integers, floating point numbers or datetimes). Typically, these intervals would have a fixed size and be non-overlapping. The idea is to have a natural representation of the grid cells used in physical models. For example, all latitude intervals at 0.5 degree increments, or datetime intervals between 12:00Z and 11:59Z the following day (note that because of the offset, the later cannot be represented with a PeriodIndex). Ideally, one could do a label based look-up by a value, and it would automatically select the right bin. Note: It occurs to me now that these "cells" are somewhat similar to the proposal CategoricalBlock (WIP: categoricals as an internal CategoricalBlock GH5313 #7217).
  2. An index wrapper Index that adds new functionality. For example, in xray we have xray.Coordinate, which is essentially a generic wrapper for pandas.Index objects with a bit of new functionality:
    • It can store arbitrary metadata along with the index in an OrderedDict (the attrs attribute).
    • It's not necessary for all the values to be loaded into memory (e.g., from disk or over a network) until they are actually needed.
    • It also has a few other minor tweaks to make thing more convenient for us, e.g., how it handles mathematical operations.

On a related note: I can't even make this second type of index convert properly when used as the argument of pandas.Index, because despite implementing an __array__ method on my Index wrapper pandas checks for explicit ndarray subclasses and otherwise falls back to converting via list in com._asarray_tuplesafe. This means I end up with a list of my 0-dimensional labeled arrays:

import xray
import pandas as pd
x = xray.Coordinate('x', ['a', 'b', 'c'])
print 'what I get:'
print pd.Index(x)
print 'what I want:'
print pd.Index(['a', 'b', 'c'])
what I get:
Index([<xray.Variable ()>\narray('a', \n      dtype='|S1')\nAttributes:\n    Empty, <xray.Variable ()>\narray('b', \n      dtype='|S1')\nAttributes:\n    Empty, <xray.Variable ()>\narray('c', \n      dtype='|S1')\nAttributes:\n    Empty], dtype='object')
what I want:
Index([u'a', u'b', u'c'], dtype='object')

(Of course, my Index wrapper is neither an actual pandas.Index nor numpy.ndarray subclass, although it is both ndarray-like and index-like.) Actually, I think I will submit a PR to fix this.

@jreback
Copy link
Contributor

jreback commented May 29, 2014

@shoyer this sounds reasonable. I think could put in #7270. Then I think you can simply try adding n an index that quack like an Index. Pls report and further issues and we'll see what needs modification.

shoyer added a commit to shoyer/pandas that referenced this issue May 29, 2014
…an Index

This allows custom ndarray-like objects which aren't actual ndarrays
to be smoothly cast to a pandas.Index:
pandas-dev#5460 (comment)
@shoyer
Copy link
Member

shoyer commented May 30, 2014

Thanks for the quick merge with #7270!

That provides an immediate fix: my "index wrappers" (my case 2) can now be converted into real pandas.Index objects, but there's no still duck typing for indices. Looks like the place to add that would be pandas.core.index._ensure_index. I will investigate further to see if I can get that working.

shoyer added a commit to shoyer/pandas that referenced this issue Jun 15, 2014
It turns out that the ndarray-like arrays of dtype datetime64 were
not being properly cast to an Index, because -- due to a bug with
np.datetime64 -- calling np.asarray(x, dtype=object) if x is an
ndarray of type datetime64 results in an *integer* array.

This PR adds tests and a work around to pd.Index.__new__.

Related pandas-dev#5460
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@mroeschke
Copy link
Member

I think https://pandas.pydata.org/pandas-docs/stable/reference/index.html satisfies describing what is public. Closing, but happy to reopen if things need clarifying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

7 participants