Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added fast C based direct CSV-to-matrix functionality with options #23

Merged
merged 1 commit into from
Jul 18, 2018

Conversation

anshuman23
Copy link
Owner

@josevalim, this function is very fast: I compared it's execution time with that of Pandas' read_csv for the MNIST test dataset (test.csv) (source) and Pandas took 2.5492329597473145 seconds while Tensorflex's load_csv_as_matrix/2 took 1.711494 seconds.

Usage is simple, and gives the user the option of passing in if header is not present (defaults to header being present), and also the ability of choosing the delimiter (defaults to ','). I have added two sample CSV files in the test folder around which I will write tests soon, but for now I am using here to showcase the usage. Also, the test.csv file I use in the example code below belongs to the MNIST Kaggle CSV data (source), which contains 28000 rows and 784 columns. It is comma delimited and also contains a header. From that, I created a custom file without the header present which I will refer to as test_without_header.csv in the examples below. I have not put the file on Github since it is relatively large in size.

Code usage:

iex(1)> mat = Tensorflex.load_csv_as_matrix("test.csv")
%Tensorflex.Matrix{
  data: #Reference<0.4024686574.590479361.258459>,
  ncols: 784,
  nrows: 28000
}

iex(2)> Tensorflex.matrix_pos mat, 5,97
80.0

iex(3)> Tensorflex.matrix_pos mat, 5,96
13.0

On a visual inspection of the very large test.csv file, one can see that the values in these particular positions are correct. Now I will show usage for the same file but without header:

iex(1)> n = Tensorflex.load_csv_as_matrix("test/test_without_header.csv", header: :false)    
%Tensorflex.Matrix{
  data: #Reference<0.4024686574.590479364.257078>,
  ncols: 784,
  nrows: 28000
}

iex(2)> Tensorflex.matrix_pos n,5,97
80.0
iex(3)> Tensorflex.matrix_pos n,5,96
13.0

Next for the sample CSV files sample2.csv and sample1.csv, I'll show that the function does the job. Here, it is also reasonable to use the Tensorflex.matrix_to_lists/1 function for the matrix because the data is not high dimensional. It is not advisable to do that for the above mentioned MNIST files:

iex(1)> m1 = Tensorflex.load_csv_as_matrix("test/sample1.csv", header: :false) 
%Tensorflex.Matrix{
  data: #Reference<0.3878093040.3013214209.247502>,
  ncols: 5,
  nrows: 3
}
iex(2)> Tensorflex.matrix_to_lists m1
[
  [1.0, 2.0, 3.0, 4.0, 5.0],
  [6.0, 7.0, 8.0, 9.0, 10.0],
  [11.0, 12.0, 13.0, 14.0, 15.0]
]

iex(3)> m2 = Tensorflex.load_csv_as_matrix("test/sample2.csv", header: :true, delimiter: "-")
%Tensorflex.Matrix{
  data: #Reference<0.4024686574.590479361.258952>,
  ncols: 4,
  nrows: 3
}

iex(4)> Tensorflex.matrix_to_lists m2
[[1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0], [9.0, 10.0, 11.0, 12.0]]

Also incorrect usage will raise:

iex(1)> n = Tensorflex.load_csv_as_matrix("../keras-mnist/testh.csv", header: :no_header, delimiter: ",")
** (ArgumentError) header indicator atom must be either :true or :false 
    (tensorflex) lib/tensorflex.ex:122: Tensorflex.load_csv_as_matrix/2

unsigned int header_atom_len;
enif_get_atom_length(env, argv[1], &header_atom_len, ERL_NIF_LATIN1);
char* header_atom = (char*)enif_alloc(header_atom_len + 1);
enif_get_atom(env, argv[1], header_atom, header_atom_len + 1, ERL_NIF_LATIN1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid creating atoms from user input such as files. We should return binaries here.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite understand what you meant. The above code is just for obtaining the header atom. The user specifies whether the header is present or not as :true or :false in Elixir and I am just reading that value in C here. Do you mean that instead of an atom, I should ask the user to send a string as argument?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path to the file is specified in arg[0] which is read as a binary in C

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anshuman23 oh, I thought you were reading from the file. This is basically the conversion of true and false to C values? Sorry the confusion then. :)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is just that.
Not at all, thanks for your inputs! :)

PS: I'll brainstorm a bit on what to do next and give you an update via email soon

@josevalim
Copy link
Contributor

👍

@anshuman23 anshuman23 merged commit eabb7cd into master Jul 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants