Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question : Exploration through Fsharp + DataFrame types #3692

Closed
ykafia opened this issue May 9, 2019 · 6 comments
Closed

Question : Exploration through Fsharp + DataFrame types #3692

ykafia opened this issue May 9, 2019 · 6 comments
Labels
P3 Doc bugs, questions, minor issues, etc. question Further information is requested

Comments

@ykafia
Copy link

ykafia commented May 9, 2019

Will ML.NET have an API for data exploration ?

By data exploration i mean statistics, selections, filters of dataframe like objects.

@glebuk
Copy link
Contributor

glebuk commented May 9, 2019

@JAYGEM,
ML.NET is a general machine learning framework, similar to Scikit-Learn, not a data manipulation library like Pandas.. The inputs for it are either a file, SQL, IDataView, or IEnumerable<T> It is designed from the ground up to work on streaming data. As a result, it's not really designed to be equivalent to Pandas in Python. There are other frameworks in C# to do this, such as LINQ and perhaps some 3rd party libraries such as Deedle (I have not used the latter one so I have no opinion on it.)

@glebuk glebuk added answered question Further information is requested labels May 9, 2019
@eerhardt
Copy link
Member

eerhardt commented May 9, 2019

@JAYGEM - Check out the discussion at https://github.com/dotnet/corefx/issues/26845. We are working on a prototype of a .NET DataFrame type in corefxlab:

https://github.com/dotnet/corefxlab/tree/master/src/Microsoft.Data and some of the PRs happening in that space:

dotnet/corefxlab#2656
dotnet/corefxlab#2660

We would really appreciate any feedback/contributions/etc in this space. If you'd like to check it out, please let us know if you find it useful.
cc @pgovind

@pgovind
Copy link

pgovind commented May 10, 2019

Yup, we're looking for feedback in this space. Please feel free to comment on the original issue or on the PRs. If you deal with pandas/dataframes/data science everyday, that would be super helpful too to get input on how the DataFrame type is shaping up.

@veikkoeeva
Copy link
Contributor

veikkoeeva commented May 18, 2019

My apologies if this is the wrong thread, but I'm not sure if extending the CoreFX discussion is appropriate for this and maybe this is some perspective here. The Arrow data mentioned in the long thread indeed tries to work around problems in data presentation. The larger problem in scientific community, as far as I understand, is object storage, new data and how to stream it effectively when the formats are, well, what they are.

So in that sense Array and time series data and streaming of data are the way to go. I would like to draw a bit attention to Zarr too, like at https://medium.com/pangeo/continuously-extending-zarr-datasets-c54fbad3967d

The Pangeo Project has been exploring the analysis of climate data in the cloud. Our preferred format for storing data in the cloud is Zarr, due to its favorable interaction with object storage. Our first Zarr cloud datasets were static, but many real operational datasets need to be continuously updated, for example, extended in time. In this post, we will show how we can play with Zarr to append to an existing archive as new data becomes available.
The problem with live data

Earth observation data which originates from e.g. satellite-based remote sensing is produced continuously, usually with a latency that depends on the amount of processing that is required to generate something useful for the end user. When storing this kind of data, we obviously don’t want to create a new archive from scratch each time new data is produced, but instead append the new data to the same archive. If this is big data, we might not even want to stage the whole dataset on our local hard drive before uploading it to the cloud, but rather directly stream it there. The nice thing about Zarr is that the simplicity of its store file structure allows us to hack around and address this kind of issue. Recent improvements to Xarray will also ease this process.

As for an example, the new ESA datahub works around this a bit so that albeit the files are about 100 MiB chunks of netCDF (data organized in HDF5 inside them, I think), they have an OData API that allows slicing inside of those files to retrieve some specific dimensions with some time ranges. The dimensions are vectors of values in binary, usually some other vector is needed to make sense of the data (e.g. points in time, coordinates).

It looks to me some people are coming across a similar kind of a problem when using Orleans: new data is generated, it needs to be stored/appended and hot data separated from the cold data (but occasionally one fetches cold data too). Then some processing is handling stream data and considerations about using AI too.

Also interesting might be https://medium.com/pangeo/step-by-step-guide-to-building-a-big-data-portal-e262af1c2977 .

@allisterb
Copy link

I've been working on a DataFrame library for F# using the DLR:
https://notebooks.azure.com/allisterb/projects/sylvester/html/Sylvester.DataFrame.ipynb

@gvashishtha gvashishtha added the P3 Doc bugs, questions, minor issues, etc. label Jan 9, 2020
@luisquintanilla
Copy link
Contributor

Thanks everyone for the discussion and feedback. Since the last comments on this thread, we've introduced the Microsoft.Data.Analysis library which brought DataFrames to .NET.

We plan on making improvements to the library and are tracking feedback and progress in this issue #6144

For samples on using the library, check out the sample notebook.

Closing this issue.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P3 Doc bugs, questions, minor issues, etc. question Further information is requested
Projects
None yet
Development

No branches or pull requests

9 participants