Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text argument to fread #2753

Merged
merged 17 commits into from
Aug 23, 2018
Merged

Add text argument to fread #2753

merged 17 commits into from
Aug 23, 2018

Conversation

HughParsonage
Copy link
Member

@HughParsonage HughParsonage commented Apr 16, 2018

In light of #2531, it may be useful to add a text argument for specialized cases where input is known in advance to be large and not a file/system command.

This may have the side-effect of reducing the timings of the test suite, if the text argument is used for small examples in tests.

Closes #1423: I've provided a way for fread to accept a multi-length text argument, as in read.table, as described in that issue page.

@codecov-io
Copy link

codecov-io commented Apr 16, 2018

Codecov Report

Merging #2753 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2753      +/-   ##
==========================================
+ Coverage   90.75%   90.78%   +0.02%     
==========================================
  Files          61       61              
  Lines       11731    11745      +14     
==========================================
+ Hits        10647    10663      +16     
+ Misses       1084     1082       -2
Impacted Files Coverage Δ
R/fread.R 92.51% <100%> (+2.07%) ⬆️
R/setops.R 84% <0%> (+0.24%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90d3736...4ac3733. Read the comment docs.

R/fread.R Outdated
@@ -1,5 +1,5 @@

fread <- function(input="",file,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings=getOption("datatable.na.strings","NA"),stringsAsFactors=FALSE,verbose=getOption("datatable.verbose",FALSE),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64","integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, index=NULL, showProgress=interactive(), data.table=getOption("datatable.fread.datatable",TRUE), nThread=getDTthreads(), logical01=getOption("datatable.logical01", FALSE), autostart=NA)
fread <- function(input="",file,text=NULL,sep="auto",sep2="auto",dec=".",quote="\"",nrows=Inf,header="auto",na.strings=getOption("datatable.na.strings","NA"),stringsAsFactors=FALSE,verbose=getOption("datatable.verbose",FALSE),skip="__auto__",select=NULL,drop=NULL,colClasses=NULL,integer64=getOption("datatable.integer64","integer64"), col.names, check.names=FALSE, encoding="unknown", strip.white=TRUE, fill=FALSE, blank.lines.skip=FALSE, key=NULL, index=NULL, showProgress=interactive(), data.table=getOption("datatable.fread.datatable",TRUE), nThread=getDTthreads(), logical01=getOption("datatable.logical01", FALSE), autostart=NA)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On first glance, I'm thinking text= should have no default (consistent with file=) and then tested with missing(text) rather than =NULL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK: I vaguely remember a preference for =NULL over missing for new arguments from another PR, but will change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this what you were remembering : #2424 (comment). That's a good point about wrapping fread. Maybe I should think again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the one. I think we could change file to file=NULL and test is.null(file) without harm.

@mattdowle
Copy link
Member

mattdowle commented Apr 17, 2018

Agree text= should be added, but for a different reason. Since #2531 is fixed, I doubt there's a speed issue and I'd be surprised if it speeds up test.data.table(). Also a bit hesitant over paste0() of length>1 text= character vector (as per #1423) ... where and why is the root use-case there? If it's connections, they could be supported directly rather than opening the door to a workaround we will have to support in future. If it's large text= input then the paste0() will add a very large single string to the global character cache.
The compelling reason for text= is security, so that folks can provide text input directly but also state that it should never be passed to the system's shell to be run there. Adding an option() to turn off fread sending input= to the shell might also be welcome perhaps. Production profiles could set that option to save needing to check or change any fread calls.

@HughParsonage
Copy link
Member Author

Can't specify a root use-case myself -- simply noticed it as an open issue. Though my hunch (and that's all it is) is that for a lot of "production" or "non-interactive" settings, there may be a need to rely on fread correctly interpreting input as text. Indeed, it may be useful to add a system_input argument when input must be interpreted as a system command so that a greater range of system commands could be used.

But it's a minor enhancement-- and maybe the paste0 hack would attract more pain than gain, so perhaps close for now.

@st-pasha
Copy link
Contributor

st-pasha commented Apr 18, 2018

My 2¢ regarding #1423:

Clearly not all features that are suggested ought to be implemented: some are too hard, some are too much of a corner case, some would unnecessary encumber the API, some cannot be implemented without sacrificing performance of the core use-case.

However I feel that a proper place for making such an argument is on the issue itself, giving the original author a better ability to defend the utility of their request. Preferably this should be done before the FR is implemented, so as not to cause any grievance to the author of the PR.

So, is #1423 a reasonable feature? I can foresee some uses for it: for example if I have a text file that I need to pre-process somehow. Say, there are comments on some lines that need to be stripped. Or it's a SQL table dump and I need to strip away "insert into mytbl values(". Or some lines contain junk and I want to remove them. In fact, if I have a problematic file, why not fread it with sep='' first, correct the problems, and then fread the corrected DT once again?

@HughParsonage The only thing I'd suggest is to allow character vector as an explicit text= argument only. This is because another possible interpretation of a character vector is that it is a list of files that we need to read (see #586).

@HughParsonage
Copy link
Member Author

@st-pasha Sorry I don't understand this:

The only thing I'd suggest is to allow character vector as an explicit text= argument only. This is because another possible interpretation of a character vector is that it is a list of files that we need to read (see #586).

@st-pasha
Copy link
Contributor

st-pasha commented Apr 18, 2018

I was trying to say that this should be allowed:

fread(text = c("foo", "bar", "baz"))

whereas this shouldn't:

fread(c("foo", "bar", "baz"))

And the reason why second version shouldn't be allowed because there is another pending FR #536 which would use this same syntax to interpret it as "parse files 'foo', 'bar' and 'baz' and return them as a list of data.tables".

(Perhaps this is already how you've implemented it, in which case we'd only need a small test to verify that that's the case).

@HughParsonage
Copy link
Member Author

I've added a test to ensure that fread(c("foo", "bar", "baz")) errors as expected.

NEWS.md Outdated
@@ -49,6 +49,7 @@ These options are meant for temporary use to aid your migration, [#2652](https:/
* `skip=` and `nrow=` are more reliable and are no longer affected by invalid lines outside the range specified. Thanks to Ziyad Saeed and Kyle Chung for reporting, [#1267](https://github.com/Rdatatable/data.table/issues/1267).
* Ram disk (`/dev/shm`) is no longer used for the output of system command input. Although faster when it worked, it was causing too many device full errors; e.g., [#1139](https://github.com/Rdatatable/data.table/issues/1139) and [zUMIs/19](https://github.com/sdparekh/zUMIs/issues/19). Thanks to Kyle Chung for reporting. Standard `tempdir()` is now used. If you wish to use ram disk, set TEMPDIR to `/dev/shm`; see `?tempdir`.
* Detecting whether a very long input string is a file name or data is now much faster, [#2531](https://github.com/Rdatatable/data.table/issues/2531). Many thanks to @javrucebo for the detailed report, benchmarks and suggestions.
* New argument `text` for an inputs which is known to be a literal string of the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New argument text for an input which is known to be a literal string. This can also accept a character vector representing a multi-line input.

@mattdowle mattdowle added this to the v1.11.2 milestone Apr 29, 2018
@mattdowle mattdowle modified the milestones: 1.12.0, 1.11.6 Aug 22, 2018
@mattdowle mattdowle merged commit 574dabb into master Aug 23, 2018
@mattdowle mattdowle deleted the fread-text-arg branch August 23, 2018 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants