Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Left Join becomes Inner Join for inequality conditions #190

Open
flcong opened this issue Jun 9, 2021 · 9 comments · May be fixed by #223
Open

Left Join becomes Inner Join for inequality conditions #190

flcong opened this issue Jun 9, 2021 · 9 comments · May be fixed by #223

Comments

@flcong
Copy link

flcong commented Jun 9, 2021

Code:

import pandas as pd
import numpy as np
from dask.distributed import Client
from dask_sql import Context
client = Client()
cont = Context()

df1 = pd.DataFrame({
    'dated': pd.date_range(pd.Timestamp('2021-01-01'), pd.Timestamp('2021-01-10')),
    'var1': np.ones(10)})
df2 = pd.DataFrame({
    'startdate': [pd.Timestamp('2020-12-30'), pd.Timestamp('2021-01-09')],
    'enddate': [pd.Timestamp('2021-01-03'), pd.Timestamp('2021-01-20')],
    'var2': np.array([2.0, 3.0])})

cont.create_table('df1', df1)
cont.create_table('df2', df2)

df3 = cont.sql(
    """select a.*, b.var2
    from df1 a left join df2 b
    on b.startdate<=a.dated and a.dated<=b.enddate""").compute()

Results:

  • df1:
       dated  var1
0 2021-01-01   1.0
1 2021-01-02   1.0
2 2021-01-03   1.0
3 2021-01-04   1.0
4 2021-01-05   1.0
5 2021-01-06   1.0
6 2021-01-07   1.0
7 2021-01-08   1.0
8 2021-01-09   1.0
9 2021-01-10   1.0
  • df2:
   startdate    enddate  var2
0 2020-12-30 2021-01-03   2.0
1 2021-01-09 2021-01-20   3.0
  • df3:
        dated  var1  var2
0  2021-01-01   1.0   2.0
2  2021-01-02   1.0   2.0
4  2021-01-03   1.0   2.0
17 2021-01-09   1.0   3.0
19 2021-01-10   1.0   3.0

This is an Inner Join, not Left Join.

The correct output should be as follows, using sqlite3:

import sqlite3
# Connect database
conn = sqlite3.connect(':memory:')
df1.to_sql('df1', conn, index=False)
df2.to_sql('df2', conn, index=False)

df3 = pd.read_sql_query(
    """select a.*, b.var2
    from df1 a left join df2 b
    on b.startdate<=a.dated and a.dated<=b.enddate""", 
    conn, 
    parse_dates=['dated'])

where df3 is

       dated  var1  var2
0 2021-01-01   1.0   2.0
1 2021-01-02   1.0   2.0
2 2021-01-03   1.0   2.0
3 2021-01-04   1.0   NaN
4 2021-01-05   1.0   NaN
5 2021-01-06   1.0   NaN
6 2021-01-07   1.0   NaN
7 2021-01-08   1.0   NaN
8 2021-01-09   1.0   3.0
9 2021-01-10   1.0   3.0
@nils-braun
Copy link
Collaborator

Hi @flcong
again thanks for the report and the nice reproducible example. This is clearly a bug. The way Dask-SQL currently handles non-equi joins is to do a cross-join (every row with every row) and then apply the filter. During this process, we turn every join to an inner join - which is wrong.
Would you like to help solving the problem? I would be very happy for your help. I can give you some guidance on how to start development. If not, I will definitely have a look.

@flcong
Copy link
Author

flcong commented Jun 9, 2021

Yeah. I can take a look tomorrow. It would be helpful to have some guidance for developers.

@nils-braun
Copy link
Collaborator

Super cool! Here are some first pointers (but of course, feel free to ask for more information):

  1. I would recommend to install dask-sql in development mode, build the JAR and run the tests once
  2. The code you are interested is in join.py, especially the second else case here. I think that the cross-join itself should work as expected (but that would need to be checked) - but the filter here might break stuff.
  3. The way I typically debug such things, is to write a small test case and run this repeatedly - but please note that due to the lazy evaluation in dask you need to include a print(X.compute()) if you want to understand more about X (or do this via debugging).

I think what currently happens is the following: I will use a very simple example with the first dataframe [1, 2, 3] and the second one [4, 5, 6]. After the cross join, we end up with all combinations, like [1, 4], [1, 5], ... [3, 6]. Then we apply the join condition as a filter, but there is just no entries [1, NULL] (which would be needed for a LEFT join). Honestly, I am not 100% sure how we would solve this best...

@flcong
Copy link
Author

flcong commented Jun 9, 2021

Thank you for the information. I'll look into it.

I'm very glad about this project. Previously, I wrote a small function to add pandas DataFrames to sqlite3 server and then run queries using sqlite3, but it turns out to be very slow for large data sets. With the development of this project, maybe some day I will finally give up SAS completely.

@nils-braun
Copy link
Collaborator

I am glad it helps you :-)

@flcong
Copy link
Author

flcong commented Jun 10, 2021

Hi. I followed the steps to install dask-sql in development mode, including installation of JDK and maven and compiled Java classes. I'm on Win10 64-bit.

I encountered a series of Windows fatal exception: access violation while running pytest on test_join.py. The test is passed, but a lot of errors are thrown. It seems irrelevant to fixing the bug for joins, right? Thank you.

================================================= test session starts =================================================
platform win32 -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: C:\Users\flcon\Documents\GitHubRepo\dask-sql, configfile: pytest.ini
plugins: cov-2.12.1
collected 8 items

test_join.py Windows fatal exception: access violation

Thread 0x00011830 (most recent call first):
  File "C:\Programs\miniconda3\envs\dask-sql\lib\concurrent\futures\thread.py", line 78 in _worker
  File "C:\Programs\miniconda3\envs\dask-sql\lib\threading.py", line 870 in run
  File "C:\Programs\miniconda3\envs\dask-sql\lib\threading.py", line 932 in _bootstrap_inner
  File "C:\Programs\miniconda3\envs\dask-sql\lib\threading.py", line 890 in _bootstrap

Current thread 0x00014938 (most recent call first):
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\jpype\_core.py", line 221 in startJVM
  File "C:\Users\flcon\Documents\GitHubRepo\dask-sql\dask_sql\java.py", line 73 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 848 in exec_module
  File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 991 in _find_and_load
  File "C:\Users\flcon\Documents\GitHubRepo\dask-sql\dask_sql\utils.py", line 18 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 848 in exec_module
  File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 991 in _find_and_load
  File "C:\Users\flcon\Documents\GitHubRepo\dask-sql\dask_sql\input_utils\convert.py", line 9 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 848 in exec_module
  File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 991 in _find_and_load
  File "C:\Users\flcon\Documents\GitHubRepo\dask-sql\dask_sql\input_utils\__init__.py", line 1 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 848 in exec_module
  File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 991 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1042 in _handle_fromlist
  File "C:\Users\flcon\Documents\GitHubRepo\dask-sql\dask_sql\context.py", line 13 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 848 in exec_module
  File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 991 in _find_and_load
  File "C:\Users\flcon\Documents\GitHubRepo\dask-sql\dask_sql\cmd.py", line 18 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 848 in exec_module
  File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 991 in _find_and_load
  File "C:\Users\flcon\Documents\GitHubRepo\dask-sql\dask_sql\__init__.py", line 2 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 848 in exec_module
  File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 991 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 961 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 991 in _find_and_load
  File "C:\Users\flcon\Documents\GitHubRepo\dask-sql\tests\integration\fixtures.py", line 48 in user_table_nan
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\fixtures.py", line 932 in call_fixture_func
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\fixtures.py", line 1126 in pytest_fixture_setup
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\callers.py", line 187 in _multicall
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\manager.py", line 84 in <lambda>
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\manager.py", line 93 in _hookexec
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\hooks.py", line 286 in __call__
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\fixtures.py", line 1072 in execute
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\fixtures.py", line 687 in _compute_fixture_value
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\fixtures.py", line 601 in _get_active_fixturedef
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\fixtures.py", line 1048 in execute
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\fixtures.py", line 687 in _compute_fixture_value
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\fixtures.py", line 601 in _get_active_fixturedef
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\fixtures.py", line 581 in getfixturevalue
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\fixtures.py", line 568 in _fillfixtures
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\python.py", line 1647 in setup
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\runner.py", line 449 in prepare
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\runner.py", line 150 in pytest_runtest_setup
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\callers.py", line 187 in _multicall
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\manager.py", line 84 in <lambda>
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\manager.py", line 93 in _hookexec
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\hooks.py", line 286 in __call__
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\runner.py", line 255 in <lambda>
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\runner.py", line 311 in from_call
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\runner.py", line 254 in call_runtest_hook
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\runner.py", line 215 in call_and_report
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\runner.py", line 120 in runtestprotocol
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\runner.py", line 109 in pytest_runtest_protocol
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\callers.py", line 187 in _multicall
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\manager.py", line 84 in <lambda>
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\manager.py", line 93 in _hookexec
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\hooks.py", line 286 in __call__
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\main.py", line 348 in pytest_runtestloop
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\callers.py", line 187 in _multicall
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\manager.py", line 84 in <lambda>
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\manager.py", line 93 in _hookexec
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\hooks.py", line 286 in __call__
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\main.py", line 323 in _main
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\main.py", line 269 in wrap_session
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\main.py", line 316 in pytest_cmdline_main
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\callers.py", line 187 in _multicall
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\manager.py", line 84 in <lambda>
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\manager.py", line 93 in _hookexec
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\pluggy\hooks.py", line 286 in __call__
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\config\__init__.py", line 162 in main
  File "C:\Programs\miniconda3\envs\dask-sql\lib\site-packages\_pytest\config\__init__.py", line 185 in console_main
  File "C:\Programs\miniconda3\envs\dask-sql\Scripts\pytest-script.py", line 10 in <module>
Windows fatal exception: access violation

@nils-braun
Copy link
Collaborator

Thanks for getting back to me! I see those also on the GitHub windows build, but as I am working with Linux I am not an expert in Windows (and debugging on a GitHub worker is quite tedious).
As far as I have seen, they do no harm.

@nils-braun
Copy link
Collaborator

Hi @flcong! Did you have time to look into the issue with the joins further? Is there anything I can help you with?

@flcong
Copy link
Author

flcong commented Aug 19, 2021

Hi, @nils-braun . Yeah, I've finished editing the join.py file to make it work, but I have not fully tested it yet. I plan to add more unit tests in the next week.

@flcong flcong changed the title Full Join becomes Inner Join for inequality conditions Left Join becomes Inner Join for inequality conditions Aug 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants