Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mention options in importing vign #3578

Merged
merged 2 commits into from
May 31, 2019
Merged

mention options in importing vign #3578

merged 2 commits into from
May 31, 2019

Conversation

jangorecki
Copy link
Member

No description provided.

## Avoid of package options

Common practice is to provide customization of various options globally for a package using `options` function. `data.table` is no exception here. Use of options of your dependency package should be avoided inside your package, and use of options by end user should be used with extra care. The reason for that is because those options works globally, for your package, other packages, and user's code.
Consider the case when `data.table` is imported by `pkgX`, where `pkgX` compute a join. Then an end user sets `options(datatable.nomatch=NULL)`, as a result join performed by `pkgX` is now an inner join, not outer join. Options are generally safe when you work just with `data.table`, or the package you are developing will be an internal package that will work just with `data.table`. Remember that global options should be well documented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A safe alternative is to set/reset options when needed, but this is strictly dominated for data.table because all options could be set explicitly in the function calls (I think?)

Discovered a case where it appears impossible to avoid using options() here:

https://twitter.com/michael_chirico/status/1126737388393230336

@codecov
Copy link

codecov bot commented May 22, 2019

Codecov Report

Merging #3578 into master will increase coverage by 0.6%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #3578     +/-   ##
=========================================
+ Coverage   97.58%   98.19%   +0.6%     
=========================================
  Files          66       66             
  Lines       12695    12922    +227     
=========================================
+ Hits        12389    12689    +300     
+ Misses        306      233     -73
Impacted Files Coverage Δ
R/fcast.R 99.4% <0%> (-0.01%) ⬇️
src/rbindlist.c 100% <0%> (ø) ⬆️
src/froll.c 100% <0%> (ø) ⬆️
src/freadR.c 100% <0%> (ø) ⬆️
src/uniqlist.c 100% <0%> (ø) ⬆️
src/assign.c 100% <0%> (ø) ⬆️
src/wrappers.c 100% <0%> (ø) ⬆️
R/xts.R 100% <0%> (ø) ⬆️
src/init.c 100% <0%> (ø) ⬆️
src/vecseq.c 100% <0%> (ø) ⬆️
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c68e95e...49485a8. Read the comment docs.

@mattdowle
Copy link
Member

mattdowle commented May 22, 2019

I fear it's not sufficient to just document this. There should be something more robust. I've been aware of this issue for some time and have not created any new options that affect behavior, only to manage migration temporarily and stated in news that those options will eventually be removed (e.g. datatable.old.unique.by.key). We only have 1 long-standing option which changes the results: datatable.nomatch. If a user changes that option for their own usage (because they prefer the nomatch=0L default), it could affect packages behavior. Even packages that didn't set any datatable options and were completely compliant with the text in this PR.

datatable.naturaljoin is now the 2nd one. Can we find an acceptable way for natural join to be explicit so we don't need the new option?

Here are all the options as of now in dev : https://github.com/Rdatatable/data.table/blob/master/R/onLoad.R#L44. Other than the those two (datatable.nomatch and datatable.naturaljoin) they all affect printing, optimization, or are temporary for migration.

This does mean we should try to remove datatable.nomatch too, or find a way for code inside packages to ignore its value.

@mattdowle
Copy link
Member

The thing with on=.NATURAL, is the user needs to know what a natural join is. That name comes from SQL. A lot of them don't seem to know SQL and find SQL difficult. Also .NATURAL is a bit long to type.
How about on=.NAMES ?

@MichaelChirico
Copy link
Member

suggestion of .NATURAL was to make the cognitive burden of new symbols hopefully minimal... you're right that to blank slate users "natural" is probably a big ?

I would suggest .SHARED might come across easier...

@jangorecki
Copy link
Member Author

jangorecki commented May 23, 2019

Agree about options risky behaviour. There is a way to detect if call was made from package topenv(parent.frame(1)) (applied), so we could potentially use this. On the other hand it introduce some inconsistency which can make some confusion, thus probably best would be to raise message when such behaviour is detected. Filled #3585.

There are many potential options already mentioned...
.COMMON
.NATURAL
.NAMES
.SHARED
Why I think .NATURAL is the best is because it is already used in some existing standard. It happens that this standard, SQL, is generally the most well recognised standard for data manipulation. So as Michael said we can avoid introducing new symbols. At least those who know SQL will pick it up easily.

@mattdowle
Copy link
Member

mattdowle commented May 23, 2019

Yes but to me, what is "natural" is X["ip576"] using the first column of X's key, like super-charged rownames. Anything else doesn't feel natural.
If I see X[.(id="ip576")] I know it's going to ignore the id= part and return the same as X["ip576"]. But if I see X[.(id="ip576"), on=.NAMES] then it feels natural that it's going to use the column names to join (id to id) in this case.

@mattdowle
Copy link
Member

mattdowle commented May 23, 2019

Blue sky idea: passing a DT to on= isn't used for anything iirc. How about X[on=Y, ...] doing what X[Y, on=.NATURAL] does now in dev. When on= is passed a data.table, it wouldn't be allowed to use i= too.
In the example I gave in previous comment, it would be X[on=.(id="ip576")].

@MichaelChirico
Copy link
Member

Blue sky idea

it's neat and could work, but are you proposing a replacement to the new feature? The fundamental use case of "X[Y] might reasonably be expected to do a natural join" remains...

@mattdowle
Copy link
Member

but are you proposing a replacement to the new feature?

just accessing the new feature with a different api, to avoid the new symbol and new option, and currently I'm quite liking it too even if it didn't avoid the new symbol and option.

The fundamental use case of "X[Y] might reasonably be expected to do a natural join

Fundamental to data.table is that X[Y] is a join using X's key like super-charged rownames, and that data.table has no rownames because key(X) replaces rownames. Even if newcomers do expect a natural join from X[Y], I can't see that changing in data.table because we like and use it a lot. There is an error message when X has no key and that suggests on=.

@mattdowle mattdowle added this to the 1.12.4 milestone May 31, 2019
@mattdowle mattdowle merged commit 06a5ed6 into master May 31, 2019
@mattdowle mattdowle deleted the vign-ops branch May 31, 2019 02:17
@mattdowle mattdowle mentioned this pull request May 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants