Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

didn't flush changes to s3 when update row degree #2794

Closed
yuhao-su opened this issue May 24, 2022 · 3 comments
Closed

didn't flush changes to s3 when update row degree #2794

yuhao-su opened this issue May 24, 2022 · 3 comments
Assignees
Labels
type/bug Something isn't working

Comments

@yuhao-su
Copy link
Contributor

I found a relevant bug. It is that when we update row degree in JoinEntry, we do not flush the changes to S3.

This does not affect inner join, but it affects outer joins and semi/anti join. This explains why these queries are affected.

As for the case of inner join, since it does not make use of row counts, perhaps we can simply not update it as the write is unnecessary...

Originally posted by @jon-chuang in #2495 (comment)

@yuhao-su yuhao-su added the type/bug Something isn't working label May 24, 2022
@yuhao-su yuhao-su self-assigned this May 24, 2022
@yuhao-su
Copy link
Contributor Author

Better do it after #2795 to avoid adding things to those that will soon be removed.

@fuyufjh
Copy link
Member

fuyufjh commented Jun 9, 2022

Is it better to store the degree in memory only?

The motivation is, since every incoming row would populate all the data with the specified join key of the hash table on the other side, we can calculate degree within a reasonable time, because there is no IO operation during this.

The advantage of this approach is that we don't need to introduce an additional column degree, which makes the hash table more consistent with the index of LookupJoin.

Alternatively, we can apply this for LookupJoin, but in this way there will be 2 different Join implementations.

@jon-chuang
Copy link
Contributor

jon-chuang commented Jun 10, 2022

To do so, we would need to populate both sides of the cache.

This is actually an additional I/O operation since we don't usually need to fetch the update side of the cache.

We actually need the degree data on the match side (which represents how many existing rows are in the update side), to decide whether the update is the first match/only remaining match for that row.

Further, in case of non-eq join conditions layered on top of an eq join, this requires re-evaluating the predicate for each row on the update side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants