advanced questions for `join` tests #18

jangorecki · 2018-08-07T06:10:33Z

st-pasha · 2018-10-24T17:28:44Z

We will also need a join on multiple columns (similar to multi-column group and sort).

jangorecki · 2019-02-19T13:57:00Z

I pushed draft of join questions.
Data is 3 id factor (2 unique, 1 dups), 3 id int (2 unique, 1 dups), 1 double.
The list of initially discussed on H2O World:

inner, singlecol, integer, big-big
inner, singlecol, integer, big-medium
inner, singlecol, integer, big-small
outer, singlecol, integer, big-medium
inner, singlecol, factor, big-medium
inner, multicol, integer, big-medium
inner, singlecol, integer, big-medium, update on join

The list did not covered the cardinality/duplicates. At the current moment all fields used in join have no duplicates. We should consider adding questions for joining on fields that contains duplicates. Data is ready for that.

db-benchmark/join-datagen.R

Lines 26 to 34 in 00c8ae2

    
           DT = data.table( 
        
             id1 = sample(all_levels[1:N], N),                  # factor unique continuous range 
        
             id2 = sample(all_levels, N),                       # factor unique 
        
             id3 = all_levels[some_dups(N, 0.1)],               # factor 0.1 dups 
        
             id4 = sample(N*1.5, N),                            # int unique continuous range 
        
             id5 = sample(N*2, N),                              # int unique 
        
             id6 = some_dups(N, 0.1),                           # int 0.1 dups 
        
             v1 =  sample(round(runif(100,max=100),4), N, TRUE) # numeric 
        
           )

jangorecki · 2019-05-03T10:44:00Z

From the 7 questions proposed above, 5 are going to be categorised as basic, testing mostly scalability, the rest plus 3 extra will be categorised advanced, testing features. For consistency with groupby task, and plotting results with benchplot.

# basic
join to small inner on int
join to medium inner on int
join to medium outer on int
join to medium inner on factor
join to big inner on int
# advanced
join to medium inner on int int
join to medium update on int
join to medium aggregate on int
join to medium rolling on int
something well stressing (row explosion join? non-equi join?)

jangorecki · 2019-08-21T10:22:44Z

note to fix chk produced by spark, juliadf and maybe others. as of now they produce chk having 0 so answers-validation.R script solution_chk check is failing.
Workaround has been introduced in

db-benchmark/report.R

Line 51 in 39fee2f

    
           ][task=="join" & batch<=1566379460, "chk":=NA_character_ # solution_chk fails in answers-validation.R script, update batch id when join scripts amended to produce chk of same length (number of ';')

should be removed when chk amended.

jangorecki · 2019-10-31T06:53:21Z

join task for 5 basic questions has been implemented.
design of datasets for join is explained in #106
as of now join task was not yet added only for clickhouse.
remaining items in scope of this issue:

add clickhouse ClickHouse join task #137
where possible (spark, pydatatable, dask?) and necessary (1e9) solution could use on-disk data
add 5 advanced questions

join,medium inner on int int,advanced
join,medium update on int,advanced
join,medium aggregate on int,advanced
join,medium rolling on int,advanced
join,big non-equi aggregate on int int int,advanced

Add Datafusion solution

jangorecki changed the title ~~add big-to-small join scenario~~ extend join tests Sep 8, 2018

jangorecki self-assigned this Oct 29, 2018

This was referenced Dec 15, 2018

add update on join task #24

Closed

Extend 2 billion row benchmarks e.g. memory usage, sorting, joining, by-reference Rdatatable/data.table#2

Closed

add join scenario to join on factor column(s) #21

Closed

jangorecki mentioned this issue May 3, 2019

[WIP] Don't use gc in pandas join benchmark. #84

Closed

jangorecki added the new task label Sep 16, 2019

jangorecki added this to the 2.1.0 milestone May 15, 2020

jangorecki removed this from the 2.1.0 milestone Nov 17, 2020

jangorecki changed the title ~~extend join tests~~ advanced questions for join tests Nov 17, 2020

jangorecki modified the milestones: 2.2.0, 2.3.0 May 20, 2021

Tmonster pushed a commit to Tmonster/db-benchmark that referenced this issue Jul 4, 2023

Add DataFusion solution (h2oai#18)

6577d9d

Add Datafusion solution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

advanced questions for `join` tests #18

advanced questions for `join` tests #18

jangorecki commented Aug 7, 2018 •

edited

Loading

st-pasha commented Oct 24, 2018

jangorecki commented Feb 19, 2019

jangorecki commented May 3, 2019 •

edited

Loading

jangorecki commented Aug 21, 2019 •

edited

Loading

jangorecki commented Oct 31, 2019 •

edited

Loading

advanced questions for join tests #18

advanced questions for join tests #18

Comments

jangorecki commented Aug 7, 2018 • edited Loading

st-pasha commented Oct 24, 2018

jangorecki commented Feb 19, 2019

jangorecki commented May 3, 2019 • edited Loading

jangorecki commented Aug 21, 2019 • edited Loading

jangorecki commented Oct 31, 2019 • edited Loading

advanced questions for `join` tests #18

advanced questions for `join` tests #18

jangorecki commented Aug 7, 2018 •

edited

Loading

jangorecki commented May 3, 2019 •

edited

Loading

jangorecki commented Aug 21, 2019 •

edited

Loading

jangorecki commented Oct 31, 2019 •

edited

Loading