You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is an opportunity in GpuShuffleCoalesceIterator for optimization where we are holding onto the semaphore before we concat on the host.
The cuDF call we are currently calling is: JCudfSerialization.concatToContiguousTable, but it's trivial to split it up since what it calls inside of cuDF is public already. The proposal is to:
Without holding the semaphore, call: JCudfSerialization.concatToHostBuffer to get a HostConcatResult.
Acquire the semaphore
Get the contiguous table on the GPU by calling: HostConcatResult.toContiguousTable.
Results with this change are promising, saving for all queries about 1 minute of runtime when adding everything up. Most queries are above the 1x line. The queries at or below 0.9 were: q52, q46, q68, q45, and q42. When executed multiple times they all went above 1x (these are single-digit second queries).
The text was updated successfully, but these errors were encountered:
An even better change may be for us to delay putting these batches on the GPU until we really need them. This would be specific for joins, where the build side could rest on the GPU for a while as we wait for the stream side. In these cases what you almost want is to concatToHostBuffer, but keep that HostConcatResult around and not need to acquire the GPU at all until the stream side is also ready to go.
In these cases what you almost want is to concatToHostBuffer, but keep that HostConcatResult around and not need to acquire the GPU at all until the stream side is also ready to go.
This is a join-specific optimization, and I don't think we should get too far ahead of ourselves there. The optimization described above should apply to all cases where we are doing host-side concat, and thus I think is a good optimization as-is. We can always add a more sophisticated optimization for the shuffle-into-a-join case, but let's get this one done first.
There is an opportunity in
GpuShuffleCoalesceIterator
for optimization where we are holding onto the semaphore before we concat on the host.The cuDF call we are currently calling is:
JCudfSerialization.concatToContiguousTable
, but it's trivial to split it up since what it calls inside of cuDF is public already. The proposal is to:JCudfSerialization.concatToHostBuffer
to get aHostConcatResult
.HostConcatResult.toContiguousTable
.Results with this change are promising, saving for all queries about 1 minute of runtime when adding everything up. Most queries are above the 1x line. The queries at or below 0.9 were: q52, q46, q68, q45, and q42. When executed multiple times they all went above 1x (these are single-digit second queries).
The text was updated successfully, but these errors were encountered: