Added fast C based direct CSV-to-matrix functionality with options #23

anshuman23 · 2018-07-17T21:16:50Z

@josevalim, this function is very fast: I compared it's execution time with that of Pandas' read_csv for the MNIST test dataset (test.csv) (source) and Pandas took 2.5492329597473145 seconds while Tensorflex's load_csv_as_matrix/2 took 1.711494 seconds.

Usage is simple, and gives the user the option of passing in if header is not present (defaults to header being present), and also the ability of choosing the delimiter (defaults to ','). I have added two sample CSV files in the test folder around which I will write tests soon, but for now I am using here to showcase the usage. Also, the test.csv file I use in the example code below belongs to the MNIST Kaggle CSV data (source), which contains 28000 rows and 784 columns. It is comma delimited and also contains a header. From that, I created a custom file without the header present which I will refer to as test_without_header.csv in the examples below. I have not put the file on Github since it is relatively large in size.

Code usage:

iex(1)> mat = Tensorflex.load_csv_as_matrix("test.csv")
%Tensorflex.Matrix{
  data: #Reference<0.4024686574.590479361.258459>,
  ncols: 784,
  nrows: 28000
}

iex(2)> Tensorflex.matrix_pos mat, 5,97
80.0

iex(3)> Tensorflex.matrix_pos mat, 5,96
13.0

On a visual inspection of the very large test.csv file, one can see that the values in these particular positions are correct. Now I will show usage for the same file but without header:

iex(1)> n = Tensorflex.load_csv_as_matrix("test/test_without_header.csv", header: :false)    
%Tensorflex.Matrix{
  data: #Reference<0.4024686574.590479364.257078>,
  ncols: 784,
  nrows: 28000
}

iex(2)> Tensorflex.matrix_pos n,5,97
80.0
iex(3)> Tensorflex.matrix_pos n,5,96
13.0

Next for the sample CSV files sample2.csv and sample1.csv, I'll show that the function does the job. Here, it is also reasonable to use the Tensorflex.matrix_to_lists/1 function for the matrix because the data is not high dimensional. It is not advisable to do that for the above mentioned MNIST files:

iex(1)> m1 = Tensorflex.load_csv_as_matrix("test/sample1.csv", header: :false) 
%Tensorflex.Matrix{
  data: #Reference<0.3878093040.3013214209.247502>,
  ncols: 5,
  nrows: 3
}
iex(2)> Tensorflex.matrix_to_lists m1
[
  [1.0, 2.0, 3.0, 4.0, 5.0],
  [6.0, 7.0, 8.0, 9.0, 10.0],
  [11.0, 12.0, 13.0, 14.0, 15.0]
]

iex(3)> m2 = Tensorflex.load_csv_as_matrix("test/sample2.csv", header: :true, delimiter: "-")
%Tensorflex.Matrix{
  data: #Reference<0.4024686574.590479361.258952>,
  ncols: 4,
  nrows: 3
}

iex(4)> Tensorflex.matrix_to_lists m2
[[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [9.0, 10.0, 11.0, 12.0]]

Also incorrect usage will raise:

iex(1)> n = Tensorflex.load_csv_as_matrix("../keras-mnist/testh.csv", header: :no_header, delimiter: ",")
** (ArgumentError) header indicator atom must be either :true or :false 
    (tensorflex) lib/tensorflex.ex:122: Tensorflex.load_csv_as_matrix/2

josevalim · 2018-07-17T21:20:34Z

c_src/Tensorflex.c

+  unsigned int header_atom_len;
+  enif_get_atom_length(env, argv[1], &header_atom_len, ERL_NIF_LATIN1);
+  char* header_atom = (char*)enif_alloc(header_atom_len + 1);
+  enif_get_atom(env, argv[1], header_atom, header_atom_len + 1, ERL_NIF_LATIN1);


We should avoid creating atoms from user input such as files. We should return binaries here.

I didn't quite understand what you meant. The above code is just for obtaining the header atom. The user specifies whether the header is present or not as :true or :false in Elixir and I am just reading that value in C here. Do you mean that instead of an atom, I should ask the user to send a string as argument?

The path to the file is specified in arg[0] which is read as a binary in C

@anshuman23 oh, I thought you were reading from the file. This is basically the conversion of true and false to C values? Sorry the confusion then. :)

Yes this is just that.
Not at all, thanks for your inputs! :)

PS: I'll brainstorm a bit on what to do next and give you an update via email soon

josevalim · 2018-07-17T21:21:06Z

👍

Added fast direct CSV-to-matrix functionality

bcd323d

josevalim reviewed Jul 17, 2018

View reviewed changes

anshuman23 merged commit eabb7cd into master Jul 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added fast C based direct CSV-to-matrix functionality with options #23

Added fast C based direct CSV-to-matrix functionality with options #23

Uh oh!

anshuman23 commented Jul 17, 2018

Uh oh!

josevalim Jul 17, 2018

Uh oh!

anshuman23 Jul 17, 2018

Uh oh!

anshuman23 Jul 17, 2018

Uh oh!

josevalim Jul 17, 2018

Uh oh!

anshuman23 Jul 17, 2018

Uh oh!

josevalim commented Jul 17, 2018

Uh oh!

Uh oh!

Added fast C based direct CSV-to-matrix functionality with options #23

Added fast C based direct CSV-to-matrix functionality with options #23

Uh oh!

Conversation

anshuman23 commented Jul 17, 2018

Uh oh!

josevalim Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

anshuman23 Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

anshuman23 Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

josevalim Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

anshuman23 Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

josevalim commented Jul 17, 2018

Uh oh!

Uh oh!