Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

advanced questions for join tests #18

Open
9 of 15 tasks
jangorecki opened this issue Aug 7, 2018 · 5 comments
Open
9 of 15 tasks

advanced questions for join tests #18

jangorecki opened this issue Aug 7, 2018 · 5 comments
Assignees
Labels
Milestone

Comments

@jangorecki
Copy link
Contributor

jangorecki commented Aug 7, 2018

Presently join tests are made on 2 integer columns tables, equal size, inner join on single column. It is because it was difficult to achieve good random numbers for 1e10 datasets used before. Now we won't go beyond 1e9 so we can easily use another set of data.
Based on the questions we want to answer in this tests we will pick/generate expected datasets.
Initial list of queries we might want to test listed below. We need to chose those which we want to have included in first iteration, rest will be left for future extensions. My picks are as follows.

Types of queries:

  • left/right outer
  • inner
  • full outer
  • join on multiple columns
  • non-equi join
  • lookup (update on join add update on join task #24)
  • lookup from multiple tables
  • row explosion join (some multiple matches, partial cross join)
  • cross join (full cartesian product)
  • temporal join

Types of fields:

Sizes of datasets:

  • big to big (1e9-1e9)
  • big to medium (1e9-1e6)
  • big to small (1e9-1e3)

Using different datasets will heavily complicate presenting benchmarks results (as this is another dimension to present on report). We can think how to overcome that.
Also we need to wisely choose subset of queries/fields/sizes as my current selection 4*2*3 gives 24 different questions, this multiply by 3 (1e7, 1e8, 1e9) and we have 72 tests. While current groupby tests has only 5*3 = 15 tests.
@mattdowle

@jangorecki jangorecki changed the title add big-to-small join scenario extend join tests Sep 8, 2018
@st-pasha
Copy link

We will also need a join on multiple columns (similar to multi-column group and sort).

@jangorecki
Copy link
Contributor Author

I pushed draft of join questions.
Data is 3 id factor (2 unique, 1 dups), 3 id int (2 unique, 1 dups), 1 double.
The list of initially discussed on H2O World:

  • inner, singlecol, integer, big-big
  • inner, singlecol, integer, big-medium
  • inner, singlecol, integer, big-small
  • outer, singlecol, integer, big-medium
  • inner, singlecol, factor, big-medium
  • inner, multicol, integer, big-medium
  • inner, singlecol, integer, big-medium, update on join

The list did not covered the cardinality/duplicates. At the current moment all fields used in join have no duplicates. We should consider adding questions for joining on fields that contains duplicates. Data is ready for that.

DT = data.table(
id1 = sample(all_levels[1:N], N), # factor unique continuous range
id2 = sample(all_levels, N), # factor unique
id3 = all_levels[some_dups(N, 0.1)], # factor 0.1 dups
id4 = sample(N*1.5, N), # int unique continuous range
id5 = sample(N*2, N), # int unique
id6 = some_dups(N, 0.1), # int 0.1 dups
v1 = sample(round(runif(100,max=100),4), N, TRUE) # numeric
)

@jangorecki
Copy link
Contributor Author

jangorecki commented May 3, 2019

From the 7 questions proposed above, 5 are going to be categorised as basic, testing mostly scalability, the rest plus 3 extra will be categorised advanced, testing features. For consistency with groupby task, and plotting results with benchplot.

# basic
join to small inner on int
join to medium inner on int
join to medium outer on int
join to medium inner on factor
join to big inner on int
# advanced
join to medium inner on int int
join to medium update on int
join to medium aggregate on int
join to medium rolling on int
something well stressing (row explosion join? non-equi join?)

@jangorecki
Copy link
Contributor Author

jangorecki commented Aug 21, 2019

note to fix chk produced by spark, juliadf and maybe others. as of now they produce chk having 0 so answers-validation.R script solution_chk check is failing.
Workaround has been introduced in

][task=="join" & batch<=1566379460, "chk":=NA_character_ # solution_chk fails in answers-validation.R script, update batch id when join scripts amended to produce chk of same length (number of ';')

should be removed when chk amended.

@jangorecki
Copy link
Contributor Author

jangorecki commented Oct 31, 2019

join task for 5 basic questions has been implemented.
design of datasets for join is explained in #106
as of now join task was not yet added only for clickhouse.
remaining items in scope of this issue:

  • add clickhouse ClickHouse join task #137
  • where possible (spark, pydatatable, dask?) and necessary (1e9) solution could use on-disk data
  • add 5 advanced questions
join,medium inner on int int,advanced
join,medium update on int,advanced
join,medium aggregate on int,advanced
join,medium rolling on int,advanced
join,big non-equi aggregate on int int int,advanced

@jangorecki jangorecki added this to the 2.1.0 milestone May 15, 2020
@jangorecki jangorecki removed this from the 2.1.0 milestone Nov 17, 2020
@jangorecki jangorecki changed the title extend join tests advanced questions for join tests Nov 17, 2020
@jangorecki jangorecki modified the milestones: 2.2.0, 2.3.0 May 20, 2021
Tmonster pushed a commit to Tmonster/db-benchmark that referenced this issue Jul 4, 2023
Add Datafusion solution
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants