-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add text
argument to fread
#2753
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2753 +/- ##
==========================================
+ Coverage 90.75% 90.78% +0.02%
==========================================
Files 61 61
Lines 11731 11745 +14
==========================================
+ Hits 10647 10663 +16
+ Misses 1084 1082 -2
Continue to review full report at Codecov.
|
R/fread.R
Outdated
@@ -1,5 +1,5 @@ | |||
|
|||
fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings=getOption("datatable.na.strings","NA"),stringsAsFactors=FALSE,verbose=getOption("datatable.verbose",FALSE),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64","integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, index=NULL, showProgress=interactive(), data.table=getOption("datatable.fread.datatable",TRUE), nThread=getDTthreads(), logical01=getOption("datatable.logical01", FALSE), autostart=NA) | |||
fread <- function(input="",file,text=NULL,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings=getOption("datatable.na.strings","NA"),stringsAsFactors=FALSE,verbose=getOption("datatable.verbose",FALSE),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64","integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, index=NULL, showProgress=interactive(), data.table=getOption("datatable.fread.datatable",TRUE), nThread=getDTthreads(), logical01=getOption("datatable.logical01", FALSE), autostart=NA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On first glance, I'm thinking text=
should have no default (consistent with file=
) and then tested with missing(text)
rather than =NULL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK: I vaguely remember a preference for =NULL
over missing for new arguments from another PR, but will change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this what you were remembering : #2424 (comment). That's a good point about wrapping fread. Maybe I should think again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the one. I think we could change file
to file=NULL
and test is.null(file)
without harm.
Agree |
Can't specify a root use-case myself -- simply noticed it as an open issue. Though my hunch (and that's all it is) is that for a lot of "production" or "non-interactive" settings, there may be a need to rely on But it's a minor enhancement-- and maybe the |
My 2¢ regarding #1423: Clearly not all features that are suggested ought to be implemented: some are too hard, some are too much of a corner case, some would unnecessary encumber the API, some cannot be implemented without sacrificing performance of the core use-case. However I feel that a proper place for making such an argument is on the issue itself, giving the original author a better ability to defend the utility of their request. Preferably this should be done before the FR is implemented, so as not to cause any grievance to the author of the PR. So, is #1423 a reasonable feature? I can foresee some uses for it: for example if I have a text file that I need to pre-process somehow. Say, there are comments on some lines that need to be stripped. Or it's a SQL table dump and I need to strip away "insert into mytbl values(". Or some lines contain junk and I want to remove them. In fact, if I have a problematic file, why not fread it with sep='' first, correct the problems, and then fread the corrected DT once again? @HughParsonage The only thing I'd suggest is to allow character vector as an explicit |
I was trying to say that this should be allowed:
whereas this shouldn't:
And the reason why second version shouldn't be allowed because there is another pending FR #536 which would use this same syntax to interpret it as "parse files 'foo', 'bar' and 'baz' and return them as a list of (Perhaps this is already how you've implemented it, in which case we'd only need a small test to verify that that's the case). |
I've added a test to ensure that |
NEWS.md
Outdated
@@ -49,6 +49,7 @@ These options are meant for temporary use to aid your migration, [#2652](https:/ | |||
* `skip=` and `nrow=` are more reliable and are no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267). | |||
* Ram disk (`/dev/shm`) is no longer used for the output of system command input. Although faster when it worked, it was causing too many device full errors; e.g., [#1139](https://github.com/Rdatatable/data.table/issues/1139) and [zUMIs/19](https://github.com/sdparekh/zUMIs/issues/19). Thanks to Kyle Chung for reporting. Standard `tempdir()` is now used. If you wish to use ram disk, set TEMPDIR to `/dev/shm`; see `?tempdir`. | |||
* Detecting whether a very long input string is a file name or data is now much faster, [#2531](https://github.com/Rdatatable/data.table/issues/2531). Many thanks to @javrucebo for the detailed report, benchmarks and suggestions. | |||
* New argument `text` for an inputs which is known to be a literal string of the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New argument text
for an input which is known to be a literal string. This can also accept a character vector representing a multi-line input.
In light of #2531, it may be useful to add a
text
argument for specialized cases whereinput
is known in advance to be large and not a file/system command.This may have the side-effect of reducing the timings of the test suite, if the text argument is used for small examples in tests.
Closes #1423: I've provided a way for
fread
to accept a multi-lengthtext
argument, as inread.table
, as described in that issue page.