-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multidimensional dtypes #3443
Comments
My suspicion is that you're going for more complex instead of simplifying.
|
@alexbw but keep in mind, this is really not efficient (in numpy or pandas), as this is now an object array, which is not able to vectorize things. Your are much better off keeping your scalar data from your images. (I believe we went thru an exercise in saving these to/from HDF a while ago). |
Yep, your suggestions are now in production here, and it's working fine On Wed, Apr 24, 2013 at 10:32 AM, jreback notifications@github.com wrote:
|
I don't know if i pointed this out before, this might work for you: http://pandas.pydata.org/pandas-docs/dev/dsintro.html#panel4d-experimental (again the data doesn't have to be homogenous in dtype, but homogeneous in shape, which may or may not help) |
Each data stream (image, velocity, temperature) is homogeneous within On Wed, Apr 24, 2013 at 10:52 AM, jreback notifications@github.com wrote:
|
heteregoenous data shapes are non-trivial. Blaze does seem headed in that direction, but not sure when will happen. |
Ok. And here, by "non-trivial", do you mean that Pandas has no plans to On Wed, Apr 24, 2013 at 11:03 AM, jreback notifications@github.com wrote:
|
what I mean is to be efficient about ti you would have to have a structure that is essentially a dictionary of 'stuff', where stuff could be heteregenous shaped. This is really best done in a class, and it application specific. You can do it with a frame/panel whatever as @y-p shows above, but it is not 'effeicient', in that numpy holds onto the dtype as 'object'. When I say efficient I mean that you can move operations down to a lower level (c-level) in order to do things. I am not even sure Blaze will do this, its really quite specific (they are about supporting chunks and operating on those, but those chunks are actually the same shapes, except for 1 dim where the chunking occurs). There is a tradeoff in code complexity & runtime-efficiency & generality. You basically have to choose where you are on that 3-d surface. Pandas has moderate complexity & generality and high runtime-efficiency. I would say numpy is lower complexity, lower generality with a simiilar runtime-efficiency. I would guess that Blaze is going to be more complex, higher efficiiency (in cases of out-of-core datasets) and about the same generality as numpy (as they are aiming to replace numpy) So even if someone had the urge to create what you are doing, they are going to have to create a new structure to hold it. It comes to do what are your bottlenecks, maybe getting more specific will help |
@jreback Just out of curiosity what is the long term goal of pandas in this vein? If blaze is to replace numpy will pandas diverge from numpy altogether, or will it use blaze as the backend? I see talk about making Series ind. of ndarray in the near future for pkl support and other reasons. |
I don't see pandas incompatible via blaze at all. my understanding (and just from reading the blog). is. that blaze is supposed to be the next gen numpy. I think their API will necessarily be very similar to what it is now, and thus be pretty transparent to pandas. my concerns now are availability and esp compatibility of their product, as it has a fairly complicated build scheme. I think pandas fills a somewhat higher level view of data right now (and will continue to do so) as far as your specific comments, I have pushed a PR to decouple series from ndarray (index also needs this addessed). thus pandas will be somewhat easier to modify its backend w/o front end (API) visibility. (so this is a good thing) supporting arbitrary dshapes within pandas existing objects IMHO is not that useful right now |
@wesm chime in? |
Any chance at all of this seeing some love? |
I think something like a In this case you could have an object where all objects share the "video frame axis" possibly more if needed. maybe it could have a |
@alexbw this is pretty non-trivial, mainly because numpy doesn't support it (ATM), though Blaze is suppopsed to. @cpcloud has a nice idea, essentially an object to hold |
I could maybe see this being implemented by a generalization of |
@cpcloud I really like this idea of a |
I have thought about the nested dtype problem and how pandas could offer a solution for that. It's tricky because it doesn't really fit with the DataFrame data model and implementation. In some sense what is needed is a more rigid table data structure that sits someplace in between NumPy structured arrays and DataFrame. I have actually been building something like this in recent months but I will not be able to release the source code for a while. |
@wesm torture! |
@wesm Looking forward to it, when it's ready. On Fri, Aug 16, 2013 at 3:09 PM, Phillip Cloud notifications@github.comwrote:
|
Any thoughts on this, @cpcloud ? |
@alexbw You should check out our project xray, which has a Our goal is pandas-like structures for N-dimensional data, though I should note that our approach avoids heterogeneous arrays and nested dtypes (way too complex in my opinion). Instead, you would make a bunch of homogeneous arrays with different sizes and put them in a |
I'm a little confused by this ticket, but I think it's the right one for my issue. I'd really like to have a column in my data frame that represents, say, a 2D position or an affine matrix (ie 2x2). I like Pandas for the nice joining and selection operations but it seems weird to me that DataFrame is not able to simply wrap a numpy structured array and offer that stuff on top. Obviously for low-dimensional stuff I could always split the elements into separate series, but then would need to join them back together again for certain uses. I've played with h5py which is able to represent the data how I'd like as a structured numpy array, but it's frustrating I can't just construct a pandas DataFrame from that directly. It seems to me that all of the pandas-level operations don't need to care that the dtype is not a scalar, all of the indexing/slicing/joining etc just needs to treat them as "values" in the series but maybe I'm missing something fundamental. I haven't got very deep in pandas yet and am still reviewing the docs so I'd appreciate a pointer if I'm missing something obvious. |
The problem @alexbw ran into at the first post here is that numpy (as far I can tell) is not good about maintaining distinctions between multi-dimensional arrays and structured dtypes, i.e., @tangobravo Pandas actually does allow you to put some structured dtypes in a series and do (at least some) basic alignment/indexing. For example:
That said, you'll quickly run into lots of issues -- for example, So, I would suggest either (1) putting your sub arrays in 1-d arrays with dtype=object (this works with pandas) or (2) trying a package like my project xray which has its own n-dimensional series and dataframe like types. |
@shoyer Thanks for the reply, and thanks for the example actually getting a structured dtype into a Series. There is obviously more complication that I realised in supporting this directly. I've had a quick look into xray and that certainly seems like a good solution for adding a bit more structure to n-d data. Also apologies if my post came across as harsh, I really appreciate all the work done on pandas and it's a huge help in my work even without n-d "columns"! |
@tangobravo You also might take a look at astropy, which has its own table type that apparently allows for multi-dimensional columns. But I haven't tested it myself. |
Just wanted to give a follow-up on how I've dealt with this. I had two problems
I ended up explicitly writing to an HDF5 file using h5py for issue 1. The code ended up being a lot tighter than I had expected. In hindsight, I should have ditched Pandas' For issue 2, I ended up just using a dictionary of arrays. It sounds primitive, but I really didn't end up needing Pandas powerful pivoting, imputation and indexing features for this project. To get the convenience of the dot-syntax (e.g. df.velocity as opposed to df['velocity'], which is a huge boon when working interactively in the IPython notebook), I cobbled together this class, which just exposes dictionary elements as dot-gettable properties.
I didn't write it, I took pieces from around the internet. The biggest unfortunate thing right now is that I have to index the elements, I can't index the structure itself. So, I cannot do
The former style comes in handy when you need to chop a dataset up whole-hog for train/test/validation splits. |
Also, if nobody objects, I'll close this issue. I think my original issue is solved, in that Pandas will not support arbitrary dtypes in Series. |
@alexbw I agree, I think this issue can be considered resolved -- this is not going to happen easily in pandas itself, and is probably better left to third party packages -- pandas does not need more scope keep. That said, I might leave it open if only so that something turns out when people search open GitHub issues for "multidimensional". Thanks also for sharing your approach. I know I'm repeating myself in this issue, but I'd like to note again for the record that each of your problems is something that xray is designed to solve (though it also tries to do more). Its |
I will check out Xray. I'm current a fan of the thinness of the Bunch
|
closing this, but can comment on specific uses here, for pandas 2.0 designs. |
With 0.11 out, Pandas supports more dtypes than before, which is very useful to us science folks. However, some data is intrinsically multi-dimensional, high enough dimensional so that using labels on columns is impractical (for instance, images).
I understand DataFrames or Panels are usually the recommended panacea for this problem. This works if the datatype doesn't have any annotation. For instance, for each frame of a video, I have electrophysiology traces, timestamps, and environmental variables measured.
I have a working solution where I explicitly separate out the non-scalar data from the scalar data. I use Pandas exclusively for the scalar data, and then a dictionary of multi-D arrays for the array data.
What is the work and overhead involved in supporting multi-D data types? I would love to keep my entire ecosystem in Pandas, as it's much faster and richer than just NumPy data wrangling.
See below for the code that I hope is possible to run, with fixes.
If you can point me to a place in the codebase where I can tinker, that would also be much appreciated.
The text was updated successfully, but these errors were encountered: