Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Add numpy.genfromtxt() #442

Closed
mw66 opened this issue Sep 21, 2022 · 32 comments
Closed

[feature request] Add numpy.genfromtxt() #442

mw66 opened this issue Sep 21, 2022 · 32 comments

Comments

@mw66
Copy link

mw66 commented Sep 21, 2022

https://forum.dlang.org/thread/recgifvrlcambwwkoobx@forum.dlang.org

Hi,

I'm just wondering what is the best way to read CSV data file into Mir (2d array) ndslice? Esp. if it can parse date into int/float.

I searched a bit, but can't find any example.

Thanks.

So, can we add numpy.genfromtxt():

https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html

Esp. with column converters?

@mw66
Copy link
Author

mw66 commented Sep 21, 2022

https://github.com/Kriyszig/magpie

BTW, found this for the reference:

from_csv(string path, int indexDepth = 1, int columnDepth = 1,int[] columns = [], char sep = ',')

@mw66
Copy link
Author

mw66 commented Sep 21, 2022

For Mir to catch up with numpy, being able to easily read CSV to import data is a must to attract data scientists.

In numpy/pandas, it's just one liner.

@9il
Copy link
Member

9il commented Sep 22, 2022

mir-ion can deserialize Mir and common 2D arrays from Binary Ion, Text Ion, JSON, MsgPack. Plus YAML support is coming soon.

We just need to add CSV there.

void serializeFromCsv(S)(scope ref S serializer, scop const(char)[] csv, /+ other params+/)
{
///on the fly
 /// - parse csv

 /// - recognise  cell patters: integers, floats, timestamps, true, false, null; all others are strings
 /// - serializer.putValue(value)
}

If we have this function implemented I will do the rest. The API will looks like:

auto matrix = csvText.deserializeCsv!(Slice!(2, double*));

@jmh530
Copy link
Contributor

jmh530 commented Sep 22, 2022

I don't really have a good sense of how to use mir.ion. I see that it references a cookbook, but it would be good to have a simple example written in D on the readme.md.

@jmh530
Copy link
Contributor

jmh530 commented Sep 22, 2022

auto matrix = deserializeCsv!(Slice!(2, double*));

Am I correct that the deserializeCsv would handle the transformation of what is read from a .csv file into the Slice type? It wouldn't handle the reading of the file itself.

@9il
Copy link
Member

9il commented Sep 23, 2022

The fixed version should looks like:

auto matrix = csvText.deserializeCsv!(Slice!(2, double*));

deserializeCsv will call serializeFromCsv to serialize CSV to binary Ion DOM (it is super fast), and then, it will call deserializeValue, which is already implemented.

Mir always splits deserializ=sation logic in two stages: First, data to binary Ion; second, binary ion to value.

The idea is that binary ion to value works fast and unified across all types of formats.

@9il
Copy link
Member

9il commented Sep 23, 2022

Assume raw major notation.

Then CSV should have the following options to be converted to Ion:

  • matrix
  • an array of records with inner keys in the first raw
  • a record of arrays with outer keys in the first column
  • a record of records with inner keys in the first raw and outer keys in the first column

These four kinds of conversion kinds allow converting CSV to Ion on the fly.

Also, a simple transposition with full memory allocation will allow other four conversions:

  • transposed matrix
  • an array of records with inner keys in the first column
  • a record of arrays with outer keys in the first raw
  • a record of records with inner keys in the first column and outer keys in the first raw

We can define CSV algebraic as follow:

module mir.algebraic_alias.csv;
/++
Definition union for $(LREF JsonAlgebraic).
+/
import mir.algebraic;
/++
CSV tagged algebraic alias.
+/
alias CsvAlgebraic = Algebraic!Csv_;

union Csv_
{
    /// Used for empty CSV scalar like one between two separators: `,,`
    typeof(null) null_;
    /// Used for false, true, False, True, and friends. Follows YAML conversion
    bool boolean;
    ///
    long integer;
    ///
    double float_;
    ///
    immutable(char)[] string;
}

@9il
Copy link
Member

9il commented Sep 23, 2022

@jmh530
Copy link
Contributor

jmh530 commented Sep 23, 2022

I don't know if this applies or not, but one thing that is very useful in R's read.csv [1] function is the ability to identify certain strings as representing NAs. So for instance, setting it so that "NA" or "N/A" or "#N/A" (you could even do -999) are known as NAs, then the entire column will be the floating point type even if there were originally some strings in it. It makes it so you don't need to read the column as a string and process it later.

Not sure what other features that would be useful, but that's one that sticks out.

[1] https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table

@mw66
Copy link
Author

mw66 commented Sep 23, 2022

Also check numpy.genfromtxt() and pandas.read_csv().

I think handle invalid nan values (including empty entry i.e. ,,), and allow user pass in column converter callbacks are two most important features.

numpy.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=" !#$%&'()*+, -./:;<=>?@[\]^{|}~", replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None, encoding='bytes', *, ndmin=0, like=None)[source]

https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

@mw66
Copy link
Author

mw66 commented Sep 23, 2022

column converter callbacks

E.g. user can plug in his own date string to int/float converter function.

@9il
Copy link
Member

9il commented Sep 24, 2022

@mw66 I can't see a reason why we may even need to provide an in-column converter callback while we have the power of mir.algebraic, mir.ndslice, and mir.functional. It may be confusing but the experience of people from scripting languages are and mine are so different that it hard to get why the want to do something one way while there is a 'common' way to do it.

Maybe we could share our experiences with each other.

Let's do the following. Please provide a CSV/TSV data sample and an example in any programming language of how you handle it. Then I will provide an example of how we can handle the data in Mir. When we do so it would be easy to figure out a good API.

@9il
Copy link
Member

9il commented Sep 24, 2022

@mw66 I have updated the draft with callback support. Please check the first unittest.

@mw66
Copy link
Author

mw66 commented Sep 24, 2022

why we may even need to provide an in-column converter callback

The reason is simple: the ndslice's type is homogeneous, it's just double/float, but the input can be mixed with date string: e.g.

Date,Open,High,Low,Close,Volume
2021-01-21 09:30:00,133.8,134.43,133.59,134.0,9166695
2021-01-21 09:35:00,134.25,135.0,134.19,134.5,4632863

You need the converter at import time, to convert the date string into double, not afterwards (how do you store string in the ndslice mixed with floats then)?

@mw66
Copy link
Author

mw66 commented Sep 24, 2022

https://github.com/libmir/mir-ion/blob/master/source/mir/csv.d#L518

So, ndslice can have mixed types for different columns? (sorry if I miss something, I'm a newbie to mir).

@9il
Copy link
Member

9il commented Sep 24, 2022

So, ndslice can have mixed types for different columns? (sorry if I miss something, I'm a newbie to mir).

No, ndslice can't. However, you aren't limited to having ndslice matrix of double. Five other options in Mir can do it:

  • WIP Series!(Timestamp*, double*, 2) support: mir.series (mir-algorithm) can store the index of one type and a ndslice data matrix of another type. mir-csv will be able to load the index from the first column.
  • Implemented Slice!(CsvAlgebraic*, 2) support: mir.algebraic.Algebraic (mir-core) including mir.aligebraic_alias.csv (mir-ion) can store value of different type. So, you could load a ndslice matrix of algebraic types and then process it.
  • Implemented Tuple!(Timestamp[], double[], double[], double[], double[], double[]): mir.functional.Tuple (mir-core, latest release), which is used in the example, is a kind of static array of different predefined types.
  • A struct of arrays:
import mir.timestamp: Timestamp;
import mir.serde: serdeKeys;
struct Data
{
    @serdeKeys("Date")
    Timestamp[] date;

    @serdeKeys("Open")
    double[] open;
    ...
}
  • Associative arrays of columns: CsvAlgebraic[][string]

And the callback for column conversion is implemented as you want.

@jmh530
Copy link
Contributor

jmh530 commented Sep 24, 2022

I would want to do something like this:

auto text = "1,2\n3,4\n5,#N/A\n";
auto matrix = //TODO
matrix.should == [[1.0, 2], [3.0, 4], [5.0, double.nan]];

I saw something with nulls in your UT, but I didn't think you could mix nulls with other data in an array.

With respect to column converters/classes, I think that is just a scripting language way to try to enforce types. It's telling the function how to process the text from the csv. For a scripting language it is good to have it when you need it, but if you have a type system then usually you would just tell it that it is a particular type (provided that type is supported).

I think one difficulty is that this involves a lot of different functionality that don't have the best documentation or examples. Even if it is incredibly powerful, more work on that front may help reduce the burden on future users.

@9il
Copy link
Member

9il commented Sep 24, 2022

I saw something with nulls in your UT, but I didn't think you could mix nulls with other data in an array.

mir.ion.conv.serde!(mir.algebraic.Nullable!double[][])(text.Csv)

@9il
Copy link
Member

9il commented Sep 24, 2022

I saw something with nulls in your UT, but I didn't think you could mix nulls with other data in an array.
mir.ion.conv.serde!(mir.algebraic.Nullable!double[][])(text.Csv)

@jmh530
The full example

/// Converting NA to NaN
unittest
{
    import mir.csv;
    import mir.algebraic: Nullable, visit;
    import mir.ion.conv: serde;
    import mir.ndslice: Slice, map, slice;
    import mir.ser.text: serializeText;
    import mir.test: should;

    auto text = "1,2\n3,4\n5,#N/A\n";
    auto matrix = text
        .Csv
        .serde!(Slice!(Nullable!double*, 2))
        .map!(visit!((double x) => x, (_) => double.nan))
        .slice;

    matrix.serializeText.should == q{[[1.0,2.0],[3.0,4.0],[5.0,nan]]};
}

@mw66
Copy link
Author

mw66 commented Sep 24, 2022

why we may even need to provide an in-column converter callback

The reason is simple: the ndslice's type is homogeneous, it's just double/float, but the input can be mixed with date string: e.g.

Date,Open,High,Low,Close,Volume
2021-01-21 09:30:00,133.8,134.43,133.59,134.0,9166695
2021-01-21 09:35:00,134.25,135.0,134.19,134.5,4632863

You need the converter at import time, to convert the date string into double, not afterwards (how do you store string in the ndslice mixed with floats then)?

BTW, here is the Python code, the converter is a dictionary keyed by column index/name:

import numpy as np
from datetime import datetime

str2date = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

data = np.genfromtxt('data.csv',dtype=None,names=True, delimiter=',', converters = {0: str2date})

@mw66
Copy link
Author

mw66 commented Sep 24, 2022

For a scripting language it is good to have it when you need it, but if you have a type system then usually you would just tell it that it is a particular type (provided that type is supported).

Without converters in D, then how do you handle the date string in my above example? (The other columns are floats)

@9il
Copy link
Member

9il commented Sep 24, 2022

Without converters in D, then how do you handle the date string in my above example?

mir.csv recognise it is a timestamp. It do it for all input fields: it parses numbers, booleans and timestamps.

@mw66
Copy link
Author

mw66 commented Sep 24, 2022

Without converters in D, then how do you handle the date string in my above example?

mir.csv recognise it is a timestamp. It do it for all input fields: it parses numbers, booleans and timestamps.

Then it's in fixed format, but in real life data you may encounter all kinds of formats e.g:

2022/09/24
2022.09.24
2022-09-24
...
24/09/2022
...
24-sep-2022
24-sept-2022
...
09/24/2022
09 24 2022
09/24/22
...
September/24/2022
...
The list can  go on and on

You'd better allow user to plug in his own converters for his own data.

And this is just for date, but the data may contain all kinds of different strings that the user want to convert to numbers in his own way.

@9il
Copy link
Member

9il commented Sep 24, 2022

You'd better allow user to plug in his own converters for his own data.

It is allowed now. Please check the conversionFinalizer.

@mw66
Copy link
Author

mw66 commented Sep 24, 2022

You'd better allow user to plug in his own converters for his own data.

It is allowed now. Please check the conversionFinalizer.

conversionFinalizer : (
            unquotedString,
            scalar,
            columnIndex,
            columnName)

So you pass out the columnIndex and columnName to let the user branch in his function to dispatch to different columns? This looks ugly, and may incur code duplication, e.g. two different csv, with col-1, col-2 swapped:

in Python, pass in dict, 2 one-liner:

data1 = np.genfromtxt("data1.csv", ..., converters = {1: cvtr1, 2: cvtr2})
data2 = np.genfromtxt("data2.csv", ..., converters = {2: cvtr1, 1: cvtr2})

With conversionFinalizer in D:

conversionFinalizer1 (...) {
  switch (columnIndex) {
    1: return cvtr1(unquotedString);
    2: return cvtr2(unquotedString);
  }
}

conversionFinalizer2 (...) {
  switch (columnIndex) {
    1: return cvtr2(unquotedString);
    2: return cvtr1(unquotedString);
  }
}

too verbose.

Why not use the Python dictionary format, and let the Mir do such branching in the library?

@9il
Copy link
Member

9il commented Sep 24, 2022

It is less verbose in Python because it is a scripting language. If you think it will be less verbose please give an example in D. But this should be a full-featured solution like the current one.

@mw66
Copy link
Author

mw66 commented Sep 24, 2022

It has nothing to do with Python being a scripting language. It's about the api function interface, and who (the lib or the user) is responsible for the column converters dispatching.

Example in D can just follow the Python api:

double cvtr1(string str) {return ...;}
double cvtr2(string str) {return ...;}

data1 = mir.genfromtxt("data1.csv", ...,  [1: cvtr1, 2: cvtr2]);  // D does not have named arg yet, let just use positional arg
data2 = mir.genfromtxt("data2.csv", ...,  [2: cvtr1, 1: cvtr2]);  // pass in D's AA.

So why this code cannot be implemented in the D library?

@9il
Copy link
Member

9il commented Sep 24, 2022

It can be added. But that isn't generalised solution. We could do an additional overload like that. Note, that this can be just a wrapper around the verbose solution.

@mw66
Copy link
Author

mw66 commented Sep 24, 2022

  1. I don't understand what you mean by "that isn't generalised solution", can you elaborate? I think passing the dictionary is as general as the current conversionFinalizer.

  2. please do the overload. As long as the dispatching is inside the library code, the user calling code will be tidy and succinct.

@9il
Copy link
Member

9il commented Sep 24, 2022

  • I don't understand what you mean by "that isn't generalised solution", can you elaborate? I think passing the dictionary is as general as the current conversionFinalizer.

conversionFinalizer provides much more context for user.

@mw66
Copy link
Author

mw66 commented Sep 24, 2022

  • I don't understand what you mean by "that isn't generalised solution", can you elaborate? I think passing the dictionary is as general as the current conversionFinalizer.

conversionFinalizer provides much more context for user.

OK. But in my experience (as a data scientist role), I've never found I need that context. People are happy with the current converters dictionary solution that is in both numpy.genfromtxt() and pandas.read_csv(), and I never heard other people complain it's too restricted.

The converters dict is succinct, and also make users from numpy & pandas feel at home, and make the transition process to Mir easier.

@jmh530
Copy link
Contributor

jmh530 commented Sep 24, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants