-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] Add numpy.genfromtxt() #442
Comments
https://github.com/Kriyszig/magpie BTW, found this for the reference: from_csv(string path, int indexDepth = 1, int columnDepth = 1,int[] columns = [], char sep = ',') |
For Mir to catch up with numpy, being able to easily read CSV to import data is a must to attract data scientists. In numpy/pandas, it's just one liner. |
mir-ion can deserialize Mir and common 2D arrays from Binary Ion, Text Ion, JSON, MsgPack. Plus YAML support is coming soon. We just need to add CSV there.
If we have this function implemented I will do the rest. The API will looks like: auto matrix = csvText.deserializeCsv!(Slice!(2, double*)); |
I don't really have a good sense of how to use |
Am I correct that the |
The fixed version should looks like: auto matrix = csvText.deserializeCsv!(Slice!(2, double*));
Mir always splits deserializ=sation logic in two stages: First, data to binary Ion; second, binary ion to value. The idea is that binary ion to value works fast and unified across all types of formats. |
Assume raw major notation. Then CSV should have the following options to be converted to Ion:
These four kinds of conversion kinds allow converting CSV to Ion on the fly. Also, a simple transposition with full memory allocation will allow other four conversions:
We can define CSV algebraic as follow: module mir.algebraic_alias.csv;
/++
Definition union for $(LREF JsonAlgebraic).
+/
import mir.algebraic;
/++
CSV tagged algebraic alias.
+/
alias CsvAlgebraic = Algebraic!Csv_;
union Csv_
{
/// Used for empty CSV scalar like one between two separators: `,,`
typeof(null) null_;
/// Used for false, true, False, True, and friends. Follows YAML conversion
bool boolean;
///
long integer;
///
double float_;
///
immutable(char)[] string;
} |
I don't know if this applies or not, but one thing that is very useful in R's Not sure what other features that would be useful, but that's one that sticks out. [1] https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table |
Also check I think handle invalid nan values (including empty entry i.e. numpy.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=" !#$%&'()*+, -./:;<=>?@[\]^{|}~", replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None, encoding='bytes', *, ndmin=0, like=None)[source] https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html |
E.g. user can plug in his own date string to int/float converter function. |
@mw66 I can't see a reason why we may even need to provide an in-column converter callback while we have the power of Maybe we could share our experiences with each other. Let's do the following. Please provide a CSV/TSV data sample and an example in any programming language of how you handle it. Then I will provide an example of how we can handle the data in Mir. When we do so it would be easy to figure out a good API. |
The reason is simple: the ndslice's type is homogeneous, it's just double/float, but the input can be mixed with date string: e.g.
You need the converter at import time, to convert the date string into double, not afterwards (how do you store string in the ndslice mixed with floats then)? |
https://github.com/libmir/mir-ion/blob/master/source/mir/csv.d#L518 So, ndslice can have mixed types for different columns? (sorry if I miss something, I'm a newbie to mir). |
No, ndslice can't. However, you aren't limited to having ndslice matrix of double. Five other options in Mir can do it:
import mir.timestamp: Timestamp;
import mir.serde: serdeKeys;
struct Data
{
@serdeKeys("Date")
Timestamp[] date;
@serdeKeys("Open")
double[] open;
...
}
And the callback for column conversion is implemented as you want. |
I would want to do something like this: auto text = "1,2\n3,4\n5,#N/A\n";
auto matrix = //TODO
matrix.should == [[1.0, 2], [3.0, 4], [5.0, double.nan]]; I saw something with nulls in your UT, but I didn't think you could mix nulls with other data in an array. With respect to column converters/classes, I think that is just a scripting language way to try to enforce types. It's telling the function how to process the text from the csv. For a scripting language it is good to have it when you need it, but if you have a type system then usually you would just tell it that it is a particular type (provided that type is supported). I think one difficulty is that this involves a lot of different functionality that don't have the best documentation or examples. Even if it is incredibly powerful, more work on that front may help reduce the burden on future users. |
|
@jmh530 /// Converting NA to NaN
unittest
{
import mir.csv;
import mir.algebraic: Nullable, visit;
import mir.ion.conv: serde;
import mir.ndslice: Slice, map, slice;
import mir.ser.text: serializeText;
import mir.test: should;
auto text = "1,2\n3,4\n5,#N/A\n";
auto matrix = text
.Csv
.serde!(Slice!(Nullable!double*, 2))
.map!(visit!((double x) => x, (_) => double.nan))
.slice;
matrix.serializeText.should == q{[[1.0,2.0],[3.0,4.0],[5.0,nan]]};
} |
BTW, here is the Python code, the converter is a dictionary keyed by column index/name:
|
Without converters in D, then how do you handle the date string in my above example? (The other columns are floats) |
|
Then it's in fixed format, but in real life data you may encounter all kinds of formats e.g:
You'd better allow user to plug in his own converters for his own data. And this is just for date, but the data may contain all kinds of different strings that the user want to convert to numbers in his own way. |
It is allowed now. Please check the |
So you pass out the columnIndex and columnName to let the user branch in his function to dispatch to different columns? This looks ugly, and may incur code duplication, e.g. two different csv, with col-1, col-2 swapped: in Python, pass in dict, 2 one-liner:
With conversionFinalizer in D:
too verbose. Why not use the Python dictionary format, and let the Mir do such branching in the library? |
It is less verbose in Python because it is a scripting language. If you think it will be less verbose please give an example in D. But this should be a full-featured solution like the current one. |
It has nothing to do with Python being a scripting language. It's about the api function interface, and who (the lib or the user) is responsible for the column converters dispatching. Example in D can just follow the Python api:
So why this code cannot be implemented in the D library? |
It can be added. But that isn't generalised solution. We could do an additional overload like that. Note, that this can be just a wrapper around the verbose solution. |
|
|
OK. But in my experience (as a data scientist role), I've never found I need that context. People are happy with the current converters dictionary solution that is in both The converters dict is succinct, and also make users from numpy & pandas feel at home, and make the transition process to Mir easier. |
I think it’s good to have all these tools laid out, but it could be good to
add a readCSV or equivalent function to minor in order to handle common use
cases.
…On Sat, Sep 24, 2022 at 2:57 PM mw66 ***@***.***> wrote:
- I don't understand what you mean by "that isn't generalised
solution", can you elaborate? I think passing the dictionary is as general
as the current conversionFinalizer.
conversionFinalizer provides much more context for user.
OK. But in my experience (as a data scientist role), I've never found I
need that context. People are happy with the current converters dictionary
solution that is in both numpy.genfromtxt() and pandas.read_csv(), and I
never heard other people complain it's too restricted.
The converters dict is succinct, and also make users from numpy & pandas
feel at home, and make the transition process to Mir easier.
—
Reply to this email directly, view it on GitHub
<#442 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADBFNSQAWJPWMIQRZZXDVJDV75FKFANCNFSM6AAAAAAQSF7EOE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
https://forum.dlang.org/thread/recgifvrlcambwwkoobx@forum.dlang.org
Hi,
I'm just wondering what is the best way to read CSV data file into Mir (2d array) ndslice? Esp. if it can parse date into int/float.
I searched a bit, but can't find any example.
Thanks.
So, can we add numpy.genfromtxt():
https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html
Esp. with column converters?
The text was updated successfully, but these errors were encountered: