-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question : Exploration through Fsharp + DataFrame types #3692
Comments
@JAYGEM, |
@JAYGEM - Check out the discussion at https://github.com/dotnet/corefx/issues/26845. We are working on a prototype of a .NET DataFrame type in corefxlab: https://github.com/dotnet/corefxlab/tree/master/src/Microsoft.Data and some of the PRs happening in that space: dotnet/corefxlab#2656 We would really appreciate any feedback/contributions/etc in this space. If you'd like to check it out, please let us know if you find it useful. |
Yup, we're looking for feedback in this space. Please feel free to comment on the original issue or on the PRs. If you deal with pandas/dataframes/data science everyday, that would be super helpful too to get input on how the DataFrame type is shaping up. |
My apologies if this is the wrong thread, but I'm not sure if extending the CoreFX discussion is appropriate for this and maybe this is some perspective here. The Arrow data mentioned in the long thread indeed tries to work around problems in data presentation. The larger problem in scientific community, as far as I understand, is object storage, new data and how to stream it effectively when the formats are, well, what they are. So in that sense Array and time series data and streaming of data are the way to go. I would like to draw a bit attention to Zarr too, like at https://medium.com/pangeo/continuously-extending-zarr-datasets-c54fbad3967d
As for an example, the new ESA datahub works around this a bit so that albeit the files are about 100 MiB chunks of netCDF (data organized in HDF5 inside them, I think), they have an OData API that allows slicing inside of those files to retrieve some specific dimensions with some time ranges. The dimensions are vectors of values in binary, usually some other vector is needed to make sense of the data (e.g. points in time, coordinates). It looks to me some people are coming across a similar kind of a problem when using Orleans: new data is generated, it needs to be stored/appended and hot data separated from the cold data (but occasionally one fetches cold data too). Then some processing is handling stream data and considerations about using AI too. Also interesting might be https://medium.com/pangeo/step-by-step-guide-to-building-a-big-data-portal-e262af1c2977 . |
I've been working on a DataFrame library for F# using the DLR: |
Thanks everyone for the discussion and feedback. Since the last comments on this thread, we've introduced the Microsoft.Data.Analysis library which brought DataFrames to .NET. We plan on making improvements to the library and are tracking feedback and progress in this issue #6144 For samples on using the library, check out the sample notebook. Closing this issue. |
Will ML.NET have an API for data exploration ?
By data exploration i mean statistics, selections, filters of dataframe like objects.
The text was updated successfully, but these errors were encountered: