Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multidimensional dtypes #3443

Closed
alexbw opened this issue Apr 24, 2013 · 30 comments
Closed

Support for multidimensional dtypes #3443

alexbw opened this issue Apr 24, 2013 · 30 comments
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Enhancement

Comments

@alexbw
Copy link

alexbw commented Apr 24, 2013

With 0.11 out, Pandas supports more dtypes than before, which is very useful to us science folks. However, some data is intrinsically multi-dimensional, high enough dimensional so that using labels on columns is impractical (for instance, images).

I understand DataFrames or Panels are usually the recommended panacea for this problem. This works if the datatype doesn't have any annotation. For instance, for each frame of a video, I have electrophysiology traces, timestamps, and environmental variables measured.

I have a working solution where I explicitly separate out the non-scalar data from the scalar data. I use Pandas exclusively for the scalar data, and then a dictionary of multi-D arrays for the array data.

What is the work and overhead involved in supporting multi-D data types? I would love to keep my entire ecosystem in Pandas, as it's much faster and richer than just NumPy data wrangling.

See below for the code that I hope is possible to run, with fixes.

If you can point me to a place in the codebase where I can tinker, that would also be much appreciated.

import pandas as pd
mydtype=np.dtype('(3,3)f4')
pd.Series(np.zeros(3,), dtype=mydtype)
Exception: Data must be 1-dimensional
@ghost
Copy link

ghost commented Apr 24, 2013

My suspicion is that you're going for more complex instead of simplifying.
However, there is the NDpanel if you want it, and you can do:

In [26]: import pandas as pd
    ...: f=lambda : np.random.random((3,3))
    ...: s=pd.Series([f() for i in range(10)], dtype='O')

In [27]: s.iloc[0]
Out[27]: 
array([[ 0.90986552,  0.30234529,  0.98927833],
       [ 0.40467537,  0.17912555,  0.06101674],
       [ 0.6623446 ,  0.69192764,  0.39398118]])

@jreback
Copy link
Contributor

jreback commented Apr 24, 2013

@alexbw but keep in mind, this is really not efficient (in numpy or pandas), as this is now an object array, which is not able to vectorize things. Your are much better off keeping your scalar data from your images. (I believe we went thru an exercise in saving these to/from HDF a while ago).

@alexbw
Copy link
Author

alexbw commented Apr 24, 2013

Yep, your suggestions are now in production here, and it's working fine
keeping scalars and higher-D arrays separate.
Just checking in to see if any of the dtype improvements might make my
use-case a little more feasible. Doesn't seem so.

On Wed, Apr 24, 2013 at 10:32 AM, jreback notifications@github.com wrote:

@alexbw https://github.com/alexbw but keep in mind, this is really not
efficient (in numpy or pandas), as this is now an object array, which is
not able to vectorize things. Your are much better off keeping your scalar
data from your images. (I believe we went thru an exercise in saving these
to/from HDF a while ago).


Reply to this email directly or view it on GitHubhttps://github.com//issues/3443#issuecomment-16934447
.

@jreback
Copy link
Contributor

jreback commented Apr 24, 2013

I don't know if i pointed this out before, this might work for you: http://pandas.pydata.org/pandas-docs/dev/dsintro.html#panel4d-experimental (again the data doesn't have to be homogenous in dtype, but homogeneous in shape, which may or may not help)

@alexbw
Copy link
Author

alexbw commented Apr 24, 2013

Each data stream (image, velocity, temperature) is homogeneous within
itself
, but they're all different sizes. That's the clincher, and it seems
like it's not on the horizon to be supported here. Blaze seems to be going
in this direction, supporting heterogeneous data shapes.

On Wed, Apr 24, 2013 at 10:52 AM, jreback notifications@github.com wrote:

I don't know if i pointed this out before, this might work for you:
http://pandas.pydata.org/pandas-docs/dev/dsintro.html#panel4d-experimental(again the data doesn't have to be homogenous in dtype, but homogeneous in
shape, which may or may not help)


Reply to this email directly or view it on GitHubhttps://github.com//issues/3443#issuecomment-16935832
.

@jreback
Copy link
Contributor

jreback commented Apr 24, 2013

heteregoenous data shapes are non-trivial. Blaze does seem headed in that direction, but not sure when will happen.

@alexbw
Copy link
Author

alexbw commented Apr 24, 2013

Ok. And here, by "non-trivial", do you mean that Pandas has no plans to
support a feature like this?

On Wed, Apr 24, 2013 at 11:03 AM, jreback notifications@github.com wrote:

heteregoenous data shapes are non-trivial. Blaze does seem headed in that
direction, but not sure when will happen.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3443#issuecomment-16936562
.

@jreback
Copy link
Contributor

jreback commented Apr 24, 2013

what I mean is to be efficient about ti you would have to have a structure that is essentially a dictionary of 'stuff', where stuff could be heteregenous shaped. This is really best done in a class, and it application specific. You can do it with a frame/panel whatever as @y-p shows above, but it is not 'effeicient', in that numpy holds onto the dtype as 'object'.

When I say efficient I mean that you can move operations down to a lower level (c-level) in order to do things. I am not even sure Blaze will do this, its really quite specific (they are about supporting chunks and operating on those, but those chunks are actually the same shapes, except for 1 dim where the chunking occurs).

There is a tradeoff in code complexity & runtime-efficiency & generality. You basically have to choose where you are on that 3-d surface. Pandas has moderate complexity & generality and high runtime-efficiency. I would say numpy is lower complexity, lower generality with a simiilar runtime-efficiency. I would guess that Blaze is going to be more complex, higher efficiiency (in cases of out-of-core datasets) and about the same generality as numpy (as they are aiming to replace numpy)

So even if someone had the urge to create what you are doing, they are going to have to create a new structure to hold it.

It comes to do what are your bottlenecks, maybe getting more specific will help

@cpcloud
Copy link
Member

cpcloud commented Apr 28, 2013

@jreback Just out of curiosity what is the long term goal of pandas in this vein? If blaze is to replace numpy will pandas diverge from numpy altogether, or will it use blaze as the backend? I see talk about making Series ind. of ndarray in the near future for pkl support and other reasons.

@jreback
Copy link
Contributor

jreback commented Apr 28, 2013

I don't see pandas incompatible via blaze at all. my understanding (and just from reading the blog). is. that blaze is supposed to be the next gen numpy. I think their API will necessarily be very similar to what it is now, and thus be pretty transparent to pandas.

my concerns now are availability and esp compatibility of their product, as it has a fairly complicated build scheme.
In addition they seem to want to incorporate index like features (kind of like labelled arrays). if they do great. I am sure pandas might use some of the infrastructure

I think pandas fills a somewhat higher level view of data right now (and will continue to do so)

as far as your specific comments, I have pushed a PR to decouple series from ndarray (index also needs this addessed). thus pandas will be somewhat easier to modify its backend w/o front end (API) visibility. (so this is a good thing)

supporting arbitrary dshapes within pandas existing objects IMHO is not that useful right now

@jreback
Copy link
Contributor

jreback commented Apr 28, 2013

@wesm chime in?

@alexbw
Copy link
Author

alexbw commented Aug 15, 2013

Any chance at all of this seeing some love?

@cpcloud
Copy link
Member

cpcloud commented Aug 15, 2013

I think something like a RelationalDataFrame or RelationalSomethingOrOther would be useful here. The idea would be to have a collection of NDFrames that share one or more common axes (Index objects). That way you could keep things separate without the complexity of nd-dtypes. Of course, you now have the complexity of an in-memory relational database.

In this case you could have an object where all objects share the "video frame axis" possibly more if needed.

maybe it could have a query method similar to 0.13 query method that does something similar except the namespace would be expanded to include all the objects on the RelationalThingaMaBob.

@jreback
Copy link
Contributor

jreback commented Aug 15, 2013

@alexbw this is pretty non-trivial, mainly because numpy doesn't support it (ATM), though Blaze is suppopsed to.

@cpcloud has a nice idea, essentially an object to hold DataFrames that has axes that are alignable (think of this as a Panel like), but you could have a mixture too, e.g. for each object only align on certain axes

@cpcloud
Copy link
Member

cpcloud commented Aug 15, 2013

I could maybe see this being implemented by a generalization of BlockManager where the blocks themselves are pandas objects.

@alexbw
Copy link
Author

alexbw commented Aug 16, 2013

@cpcloud I really like this idea of a RelationalDataFrame. Then, hopefully, we'd be able to do HDFStore-type select operations. This would have incredible power for a lot of applications in the biological and physical sciences (my field).

@wesm
Copy link
Member

wesm commented Aug 16, 2013

I have thought about the nested dtype problem and how pandas could offer a solution for that. It's tricky because it doesn't really fit with the DataFrame data model and implementation. In some sense what is needed is a more rigid table data structure that sits someplace in between NumPy structured arrays and DataFrame. I have actually been building something like this in recent months but I will not be able to release the source code for a while.

@cpcloud
Copy link
Member

cpcloud commented Aug 16, 2013

@wesm torture!

@alexbw
Copy link
Author

alexbw commented Aug 16, 2013

@wesm Looking forward to it, when it's ready.

On Fri, Aug 16, 2013 at 3:09 PM, Phillip Cloud notifications@github.comwrote:

@wesm https://github.com/wesm torture!


Reply to this email directly or view it on GitHubhttps://github.com//issues/3443#issuecomment-22787295
.

@alexbw
Copy link
Author

alexbw commented Oct 22, 2013

Any thoughts on this, @cpcloud ?

@shoyer
Copy link
Member

shoyer commented Aug 15, 2014

@alexbw You should check out our project xray, which has a Dataset object that is basically @cpcloud's RelationalDataFrame -- a bunch of multi-dimensional labeled arrays aligned along any number of common axes.

Our goal is pandas-like structures for N-dimensional data, though I should note that our approach avoids heterogeneous arrays and nested dtypes (way too complex in my opinion). Instead, you would make a bunch of homogeneous arrays with different sizes and put them in a Dataset.

@tangobravo
Copy link

I'm a little confused by this ticket, but I think it's the right one for my issue. I'd really like to have a column in my data frame that represents, say, a 2D position or an affine matrix (ie 2x2). I like Pandas for the nice joining and selection operations but it seems weird to me that DataFrame is not able to simply wrap a numpy structured array and offer that stuff on top.

Obviously for low-dimensional stuff I could always split the elements into separate series, but then would need to join them back together again for certain uses. I've played with h5py which is able to represent the data how I'd like as a structured numpy array, but it's frustrating I can't just construct a pandas DataFrame from that directly.

It seems to me that all of the pandas-level operations don't need to care that the dtype is not a scalar, all of the indexing/slicing/joining etc just needs to treat them as "values" in the series but maybe I'm missing something fundamental. I haven't got very deep in pandas yet and am still reviewing the docs so I'd appreciate a pointer if I'm missing something obvious.

@shoyer
Copy link
Member

shoyer commented Nov 24, 2014

The problem @alexbw ran into at the first post here is that numpy (as far I can tell) is not good about maintaining distinctions between multi-dimensional arrays and structured dtypes, i.e., np.zeros(2, dtype=np.dtype('(2,2)f8')) produces the exact same array as np.zeros((2,2,2)) (as far as I can tell).

@tangobravo Pandas actually does allow you to put some structured dtypes in a series and do (at least some) basic alignment/indexing. For example:

>>> x = np.zeros(2, dtype='float, (2, 2)float')
>>> y = pd.Series(x, index=['a', 'b'])
>>> y.loc['a']
(0.0, [[0.0, 0.0], [0.0, 0.0]])

That said, you'll quickly run into lots of issues -- for example, repr(y) gives an error. Unfortunately, it's not so easy for pandas to be agnostic about the dtype of an ndarray. There are lots of operations where structured dtypes could really throw things off (e.g., handling missing values). If you really want to work on this, I expect patches would be accepted, but I don't think it would be a good idea for the pandas maintainers to take on responsibility for ensuring things with structured dtypes don't break again.

So, I would suggest either (1) putting your sub arrays in 1-d arrays with dtype=object (this works with pandas) or (2) trying a package like my project xray which has its own n-dimensional series and dataframe like types.

@tangobravo
Copy link

@shoyer Thanks for the reply, and thanks for the example actually getting a structured dtype into a Series. There is obviously more complication that I realised in supporting this directly. I've had a quick look into xray and that certainly seems like a good solution for adding a bit more structure to n-d data.

Also apologies if my post came across as harsh, I really appreciate all the work done on pandas and it's a huge help in my work even without n-d "columns"!

@shoyer
Copy link
Member

shoyer commented Nov 24, 2014

@tangobravo You also might take a look at astropy, which has its own table type that apparently allows for multi-dimensional columns. But I haven't tested it myself.

@alexbw
Copy link
Author

alexbw commented Dec 23, 2014

Just wanted to give a follow-up on how I've dealt with this. I had two problems

  1. Efficiently store and retrieve large, structured datasets. Some aspects of the dataset are scalar (e.g. velocity at some timepoint), others are inherently multi-dimensional (e.g. an image at some timepoint). All share the same time index.
  2. Manipulate large structured datasets in memory, for analysis and plotting purposes.

I ended up explicitly writing to an HDF5 file using h5py for issue 1. The code ended up being a lot tighter than I had expected. In hindsight, I should have ditched Pandas' to_hdf function early on when I realized my requirements were out of Pandas' scope. By being explicit about how my data is structured in the HDF5 file, I can also take better advantage of compression. My saved files are an order of magnitude smaller, and reading and writing is much faster as well.

For issue 2, I ended up just using a dictionary of arrays. It sounds primitive, but I really didn't end up needing Pandas powerful pivoting, imputation and indexing features for this project. To get the convenience of the dot-syntax (e.g. df.velocity as opposed to df['velocity'], which is a huge boon when working interactively in the IPython notebook), I cobbled together this class, which just exposes dictionary elements as dot-gettable properties.

class Bunch(dict):
    def __init__(self, *args, **kw):
        dict.__init__(self, kw)
        self.__dict__ = self
        if len(args) > 0:
            assert len(args) == 1 and isinstance(args[0], dict), "Can either pass in a dictionary, or keyword arguments"
            self.__dict__.update(args[0])

    def __getstate__(self):
        return self

    def __setstate__(self, state):
        self.update(state)
        self.__dict__ = self

I didn't write it, I took pieces from around the internet.

The biggest unfortunate thing right now is that I have to index the elements, I can't index the structure itself. So, I cannot do

df[index].images, I have to do df.images[index].

The former style comes in handy when you need to chop a dataset up whole-hog for train/test/validation splits.

@alexbw
Copy link
Author

alexbw commented Dec 23, 2014

Also, if nobody objects, I'll close this issue. I think my original issue is solved, in that Pandas will not support arbitrary dtypes in Series.

@shoyer
Copy link
Member

shoyer commented Feb 5, 2015

@alexbw I agree, I think this issue can be considered resolved -- this is not going to happen easily in pandas itself, and is probably better left to third party packages -- pandas does not need more scope keep. That said, I might leave it open if only so that something turns out when people search open GitHub issues for "multidimensional".

Thanks also for sharing your approach. I know I'm repeating myself in this issue, but I'd like to note again for the record that each of your problems is something that xray is designed to solve (though it also tries to do more). Its Dataset object acts like your Bunch (I recently added attribute-style access for variables) but it does have support for simultaneous indexing of all variables. It also supports direct output to multi-dimensional netCDF4 files with optional chunking/compression, similar to what you accomplished with h5py (netCDF4 is a subtype of HDF5 with particular metadata conventions).

@alexbw
Copy link
Author

alexbw commented Feb 5, 2015

I will check out Xray. I'm current a fan of the thinness of the Bunch
approach, but a lack of a global index is annoying. I am enjoying also the
efficiency of hand-tuned HDF5 data structures, I can go way beyond what
pandas can do out of the box by paying close attention to how data is
written. I am excited to see if xray helps automate that process (it
definitely doesn't have to be as manual as I currently do it).
On Thu, Feb 5, 2015 at 4:02 AM Stephan Hoyer notifications@github.com
wrote:

@alexbw https://github.com/alexbw I agree, I think this issue can be
considered resolved -- this is not going to happen easily in pandas itself,
and is probably better left to third party packages -- pandas does not need
more scope keep. That said, I might leave it open if only so that something
turns out when people search open GitHub issues for "multidimensional".

Thanks also for sharing your approach. I know I'm repeating myself in this
issue, but I'd like to note again for the record that each of your problems
is something that xray https://github.com/xray/xray/ is designed to
solve (though it also tries to do more). Its Dataset object acts like
your Bunch (I recently added attribute-style access for variables) but it
does have support for simultaneous indexing of all variables. It also
supports direct output to multi-dimensional netCDF4 files with optional
chunking/compression, similar to what you accomplished with h5py (netCDF4
is a subtype of HDF5 with particular metadata conventions).


Reply to this email directly or view it on GitHub
#3443 (comment).

@jreback
Copy link
Contributor

jreback commented Oct 5, 2016

closing this, but can comment on specific uses here, for pandas 2.0 designs.

@jreback jreback closed this as completed Oct 5, 2016
@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Oct 5, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Someday Oct 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

No branches or pull requests

7 participants