-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMML-153] Read file with extension 'csv' not require mtd file #66
[SYSTEMML-153] Read file with extension 'csv' not require mtd file #66
Conversation
Did we actually arrive at a consensus on this JIRA? |
Tried this out on Spark and it works great. Having the ability to automatically infer a CSV format for a ".csv" file extension is smart, and avoids confusion for the new user. This doesn't affect any existing options such as supplying a Also, I ran this using our standalone distribution, and received a |
Great catch @dusenberrymw . I had assumed commons-io was already added to the libs for standalone since it is such a common library, but I was mistaken. I'll update the standalone assembly to include it. @mboehm7 I don't believe there was a consensus. My main concern with this issue is that I would prefer not to intimidate new users by requiring them to do things like |
As a new user of SystemML I vote for the idea. When I first started to use SystemML the accidental errors in meta data was the most frustrating part. It took me awhile to search through documentations to find my mistakes only to find the error was a typo in mtd. To build a simple sample I had to write 3 mtds manually:
echo '{
"data_type": "frame",
"format": "csv",
"sep": ",",
"header": false,
"na.strings": [ "NA", "null", "NULL", "NaN" ]
}' | hadoop fs -put - my.data.file.csv.mtd
echo '1,2,2,1,1' | hadoop fs -put - file-type.csv
echo '{"rows": 1, "cols" : 5, "format":"csv"}' | hadoop fs -put - file-type.csv.mtd
printf "0.7\n0.3" | hadoop fs -put - split-perc.csv
echo '{"rows": 2, "cols": 1, "format": "csv"}' | hadoop fs -put - split-perc.csv.mtd |
I think the first one is not avoidable though. |
@ethanyxu Thank you for your comments. I had a very similar experience. Your second and third examples are exactly the kinds of situations where I don't want to supply metadata. |
30f4700
to
fcdda0b
Compare
Since no consensus could be reached and the MLContext API allows data input without what I would consider burdensome mandatory metadata requirements, I'll close this PR. |
Add construction of federated object with `federated` Adds a new function `federated` which takes two parameters `addresses` and `ranges`. Closes #66.
If a read statement reads a file with a .csv extension with no format parameter specified, such as
read("m.csv")
, the file is considered to be a csv file so that a metadata file is optional and not required.