Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cbindlist, mergelist #4370

Open
wants to merge 121 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 101 commits
Commits
Show all changes
121 commits
Select commit Hold shift + click to select a range
3e72c8d
cbindlist
jangorecki Apr 10, 2020
a915832
add cbind by reference, timing
jangorecki Apr 10, 2020
05dd562
R prototype of mergelist
jangorecki Apr 10, 2020
cba5bc1
wording
jangorecki Apr 10, 2020
1edf4d3
use lower overhead funs
jangorecki Apr 10, 2020
36bbd25
stick to int32 for now, correct R_alloc
jangorecki Apr 16, 2020
7d51dd6
bmerge C refactor for codecov and one loop for speed
jangorecki Apr 16, 2020
0437da5
address revealed codecov gaps
jangorecki Apr 16, 2020
e287213
refactor vecseq for codecov
jangorecki Apr 16, 2020
5dc07bd
seqexp helper, some alloccol export on C
jangorecki Apr 17, 2020
a4d124e
bmerge codecov, types handled in R bmerge already
jangorecki Apr 17, 2020
40d3bfe
better comment seqexp
jangorecki Apr 17, 2020
beffe39
bmerge mult=error #655
jangorecki Apr 17, 2020
4e211a1
multiple new C utils
jangorecki Apr 17, 2020
fbddcd6
swap if branches
jangorecki Apr 17, 2020
01b2f9d
explain new C utils
jangorecki Apr 17, 2020
c8e070b
comments mostly
jangorecki Apr 17, 2020
3004748
reduce conflicts to PR #4386
jangorecki Apr 18, 2020
cf73fcf
comment C code
jangorecki Apr 19, 2020
b64c0c3
address multiple matches during update-on-join #3747
jangorecki Apr 19, 2020
348d5b7
Revert "address multiple matches during update-on-join #3747"
jangorecki Apr 19, 2020
df0c11a
merge.dt has temporarily mult arg, for testing
jangorecki Apr 24, 2020
5793508
minor changes to cbindlist c
jangorecki Apr 24, 2020
6017eac
dev mergelist, for single pair now
jangorecki Apr 24, 2020
f88e0de
add quiet option to cc()
jangorecki Apr 25, 2020
2387f09
mergelist tests
jangorecki Apr 25, 2020
5ae7d4d
add check for names to perhaps.dt
jangorecki Apr 25, 2020
d0b2af8
rm mult from merge.dt method
jangorecki Apr 25, 2020
7e51189
rework, clean, polish multer, fix righ and full joins
jangorecki Apr 25, 2020
ea77bce
make full join symmetric
jangorecki Apr 26, 2020
06a1ae8
mergepair inner function to loop on
jangorecki Apr 26, 2020
a942940
extra check for symmetric
jangorecki Apr 26, 2020
dc5f263
mergelist manual
jangorecki Apr 26, 2020
bc17057
ensure no df-dt passed where list expected
jangorecki Apr 26, 2020
db30e44
comments and manual
jangorecki Apr 26, 2020
0dd82c3
handle 0 cols tables
jangorecki Apr 26, 2020
9fe7f55
more tests
jangorecki Apr 26, 2020
113f688
more tests and debugging
jangorecki Apr 26, 2020
9bcb814
move more logic closer to bmerge, simplify mergepair
jangorecki Apr 26, 2020
a7f39c9
more tests
jangorecki Apr 26, 2020
b1f39a6
revert not used changes
jangorecki Apr 26, 2020
29bd438
reduce not needed checks, cleanup
jangorecki Apr 26, 2020
ca0d76a
copy arg behavior, manual, no tests yet
jangorecki Apr 26, 2020
9ac7a89
cbindlist manual, export both
jangorecki Apr 27, 2020
384396b
cleanup processing bmerge to dtmatch
jangorecki Apr 28, 2020
11974f0
test function match order for easier preview
jangorecki Apr 28, 2020
de48d2d
vecseq gets short-circuit
jangorecki Apr 28, 2020
66e7d53
batch test allow browser
jangorecki Apr 28, 2020
25d0633
big cleanup
jangorecki Apr 29, 2020
fee063b
remmove unneeded stuff, reduce diff
jangorecki Apr 29, 2020
84d7146
more cleanup, minor manual fixes
jangorecki Apr 29, 2020
d78d136
add proper test scripts
jangorecki Apr 29, 2020
2b1795b
Merge branch 'master' into cbind-merge-list
jangorecki Apr 29, 2020
dabb55c
comment out not used code for coverage
jangorecki Apr 29, 2020
3ca7d4a
more tests, some nocopy opts
jangorecki Apr 30, 2020
e4b14e6
rename sql test script, should fix codecov
jangorecki Apr 30, 2020
b1fce17
simplify dtmatch inner branch
jangorecki Apr 30, 2020
50f9e89
more precise copy, now copy only T or F
jangorecki Apr 30, 2020
d43be04
unused arg not yet in api, wording
jangorecki Apr 30, 2020
4580dd4
comments and refer issues
jangorecki Apr 30, 2020
5d0e991
codecov
jangorecki Apr 30, 2020
03aa427
hasindex coverage
jangorecki Apr 30, 2020
b15ab93
codecov gap
jangorecki Apr 30, 2020
17d2fa8
tests for join using key, cols argument
jangorecki Apr 30, 2020
492d3b5
fix missing import forderv
jangorecki Apr 30, 2020
a5c4a26
more tests, improve missing on handling
jangorecki May 1, 2020
426e187
more tests for order of inner and full join for long keys
jangorecki May 1, 2020
c8ded9c
new allow.cartesian option, #4383, #914
jangorecki May 3, 2020
674bff8
reduce diff, improve codecov
jangorecki May 3, 2020
0a483c2
reduce diff, comments
jangorecki May 3, 2020
db3249a
need more DT, not lists, mergelist 3+ tbls
jangorecki May 3, 2020
a573286
proper escape heavy check
jangorecki May 3, 2020
78123b0
unit tests
jangorecki May 4, 2020
9273212
more tests, address overalloc failure
jangorecki May 6, 2020
c5df010
mergelist and cbindlist retain index
jangorecki May 8, 2020
64d4f5e
manual, examples
jangorecki May 8, 2020
211da09
fix manual
jangorecki May 8, 2020
4275e8c
minor clarify in manual
jangorecki May 8, 2020
102de68
retain keys, right outer join for snowflake schema joins
jangorecki May 8, 2020
4487360
duplicates in cbindlist
jangorecki May 8, 2020
2923719
recycling in cbindlist
jangorecki May 9, 2020
658410b
escape 0 input in copyCols
jangorecki May 9, 2020
b708507
empty input handling
jangorecki May 9, 2020
1b9f913
closing cbindlist
jangorecki May 9, 2020
d70cd41
vectorized _on_ and _join.many_ arg
jangorecki May 10, 2020
a5179b7
rename dtmatch to dtmerge
jangorecki May 10, 2020
c86c9ad
vectorized args: how, mult
jangorecki May 10, 2020
b89b6f8
full join, reduce overhead for mult=error
jangorecki May 10, 2020
249e09a
mult default value dynamic
jangorecki May 11, 2020
1fa9b40
fix manual
jangorecki May 11, 2020
f889b0e
add "see details" to Rd
MichaelChirico May 11, 2020
f35a555
mention shared on in arg description
MichaelChirico May 11, 2020
52d4f9f
amend feedback from Michael
jangorecki May 11, 2020
2884b29
semi and anti joins will not reorder x columns
jangorecki May 11, 2020
3f3f9de
Merge branch 'master' into cbind-merge-list
jangorecki Dec 9, 2023
6df88bc
spelling, thx to @jan-glx
jangorecki Dec 9, 2023
060610b
check all new funs used and add comments
jangorecki Dec 9, 2023
db58b6c
bugfix, sort=T needed for now
jangorecki Dec 9, 2023
53b9b0d
Merge branch 'master' into cbind-merge-list
MichaelChirico Feb 19, 2024
c6add42
Update NEWS.md
MichaelChirico Feb 19, 2024
ec1973f
Merge branch 'master' into cbind-merge-list
MichaelChirico Feb 19, 2024
d26265f
Merge branch 'master' into cbind-merge-list
MichaelChirico Aug 28, 2024
115b1eb
NEWS placement
MichaelChirico Aug 28, 2024
e2ae4d0
numbering
MichaelChirico Aug 28, 2024
9a1d7db
ascArg->order
MichaelChirico Aug 28, 2024
3ead046
Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-mer…
MichaelChirico Aug 28, 2024
d579af4
attempt to restore from master
MichaelChirico Aug 28, 2024
9a51230
Update to stopf() error style
MichaelChirico Aug 28, 2024
1b363ad
Need isFrame for now
MichaelChirico Aug 28, 2024
e9387d2
More quality checks: any(!x)->!all(x); use vapply_1{b,c,i}
MichaelChirico Aug 28, 2024
b30437b
really restore from master
MichaelChirico Aug 28, 2024
6b9aa6c
try to PROTECT() before duplicate()
MichaelChirico Aug 28, 2024
71bb8b1
update error message in test
MichaelChirico Aug 28, 2024
40191d7
appease the rchk gods
MichaelChirico Aug 29, 2024
3758316
extraneous space
MichaelChirico Aug 29, 2024
e4e5d8c
missing ';'
MichaelChirico Aug 29, 2024
338711a
use catf
MichaelChirico Aug 29, 2024
008abef
simplify perhapsDataTableR
MichaelChirico Aug 29, 2024
854d35e
move sqlite.Rraw.manual into other.Rraw
MichaelChirico Aug 29, 2024
c975c14
simplify for loop
MichaelChirico Aug 29, 2024
5952dd8
Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-mer…
MichaelChirico Aug 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ export(nafill)
export(setnafill)
export(.Last.updated)
export(fcoalesce)
export(cbindlist)
export(mergelist)
export(substitute2)
#export(DT) # mtcars |> DT(i,j,by) #4872 #5472

Expand Down
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@

2. `cedta()` now returns `FALSE` if `.datatable.aware = FALSE` is set in the calling environment, [#5654](https://github.com/Rdatatable/data.table/issues/5654).

3. (add example here?) New functions `cbindlist` and `mergelist` have been implemented and exported. Works like `cbind`/`merge` but takes `list` of data.tables on input. `merge` happens in `Reduce` fashion. Supports `how` (_left_, _inner_, _full_, _right_, _semi_, _anti_, _cross_) joins and `mult` argument, closes [#599](https://github.com/Rdatatable/data.table/issues/599) and [#2576](https://github.com/Rdatatable/data.table/issues/2576).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-citation here?

(add example here?)

also please address.


## NOTES

1. `transform` method for data.table sped up substantially when creating new columns on large tables. Thanks to @OfekShilon for the report and PR. The implemented solution was proposed by @ColeMiller1.
Expand Down
26 changes: 18 additions & 8 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ replace_dot_alias = function(e) {
}
return(x)
}
if (!mult %chin% c("first","last","all")) stopf("mult argument can only be 'first', 'last' or 'all'")
if (!mult %chin% c("first","last","all","error")) stop("mult argument can only be 'first', 'last', 'all' or 'error'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved to an antecedent PR as well?

missingroll = missing(roll)
if (length(roll)!=1L || is.na(roll)) stopf("roll must be a single TRUE, FALSE, positive/negative integer/double including +Inf and -Inf or 'nearest'")
if (is.character(roll)) {
Expand Down Expand Up @@ -505,6 +505,7 @@ replace_dot_alias = function(e) {
}
i = .shallow(i, retain.key = TRUE)
ans = bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult, ops, verbose=verbose)
if (mult=="error") mult="all"
xo = ans$xo ## to make it available for further use.
# temp fix for issue spotted by Jan, test #1653.1. TODO: avoid this
# 'setorder', as there's another 'setorder' in generating 'irows' below...
Expand All @@ -526,13 +527,22 @@ replace_dot_alias = function(e) {
if (!byjoin || nqbyjoin) {
# Really, `anyDuplicated` in base is AWESOME!
# allow.cartesian shouldn't error if a) not-join, b) 'i' has no duplicates
if (verbose) {last.started.at=proc.time();catf("Constructing irows for '!byjoin || nqbyjoin' ... ");flush.console()}
irows = if (allLen1) f__ else vecseq(f__,len__,
if (allow.cartesian ||
notjoin || # #698. When notjoin=TRUE, ignore allow.cartesian. Rows in answer will never be > nrow(x).
!anyDuplicated(f__, incomparables = c(0L, NA_integer_))) {
NULL # #742. If 'i' has no duplicates, ignore
} else as.double(nrow(x)+nrow(i))) # rows in i might not match to x so old max(nrow(x),nrow(i)) wasn't enough. But this limit now only applies when there are duplicates present so the reason now for nrow(x)+nrow(i) is just to nail it down and be bigger than max(nrow(x),nrow(i)).
if (verbose) {last.started.at=proc.time();cat("Constructing irows for '!byjoin || nqbyjoin' ... ");flush.console()}
irows = if (allLen1) f__ else {
join.many = getOption("datatable.join.many") # #914, default TRUE for backward compatibility
anyDups = if (!join.many && length(f__)==1L && len__==nrow(x)) {
NULL # special case of scalar i match to const duplicated x, not handled by anyDuplicate: data.table(x=c(1L,1L))[data.table(x=1L), on="x"]
} else if (!notjoin && ( # #698. When notjoin=TRUE, ignore allow.cartesian. Rows in answer will never be > nrow(x).
!allow.cartesian ||
!join.many))
as.logical(anyDuplicated(f__, incomparables = c(0L, NA_integer_)))
limit = if (!is.null(anyDups) && anyDups) { # #742. If 'i' has no duplicates, ignore
if (!join.many) stop("Joining resulted in many-to-many join. Perform quality check on your data, use mult!='all', or set 'datatable.join.many' option to TRUE to allow rows explosion.")
else if (!allow.cartesian && !notjoin) as.double(nrow(x)+nrow(i))
else stop("internal error: checking allow.cartesian and join.many, unexpected else branch reached, please report to issue tracker") # nocov
}
vecseq(f__, len__, limit)
} # rows in i might not match to x so old max(nrow(x),nrow(i)) wasn't enough. But this limit now only applies when there are duplicates present so the reason now for nrow(x)+nrow(i) is just to nail it down and be bigger than max(nrow(x),nrow(i)).
if (verbose) {cat(timetaken(last.started.at),"\n"); flush.console()}
# Fix for #1092 and #1074
# TODO: implement better version of "any"/"all"/"which" to avoid
Expand Down
Loading
Loading