You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The motivation is, since every incoming row would populate all the data with the specified join key of the hash table on the other side, we can calculate degree within a reasonable time, because there is no IO operation during this.
The advantage of this approach is that we don't need to introduce an additional column degree, which makes the hash table more consistent with the index of LookupJoin.
Alternatively, we can apply this for LookupJoin, but in this way there will be 2 different Join implementations.
To do so, we would need to populate both sides of the cache.
This is actually an additional I/O operation since we don't usually need to fetch the update side of the cache.
We actually need the degree data on the match side (which represents how many existing rows are in the update side), to decide whether the update is the first match/only remaining match for that row.
Further, in case of non-eq join conditions layered on top of an eq join, this requires re-evaluating the predicate for each row on the update side.
I found a relevant bug. It is that when we update row degree in JoinEntry, we do not flush the changes to S3.
This does not affect inner join, but it affects outer joins and semi/anti join. This explains why these queries are affected.
As for the case of inner join, since it does not make use of row counts, perhaps we can simply not update it as the write is unnecessary...
Originally posted by @jon-chuang in #2495 (comment)
The text was updated successfully, but these errors were encountered: