You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importnumpyasnpimportjanitorimportpandasaspddefeast_west():
num_rows_left, num_rows_right=5_000_000, 5_000_000rng=np.random.default_rng(42)
# Generate two separate datasets where revenue/cost are linearly related to# duration/time, but add some noise to the west table so that there are some# rows where the cost for the same or greater time will be less than the east table.east_dur=rng.integers(1_000, 10_000_000, num_rows_left)
east_rev= (east_dur*0.123).astype(np.int32)
west_time=rng.integers(1_000, 500_000, num_rows_right)
west_cost=west_time*0.123west_cost+=rng.normal(0.0, 1.0, num_rows_right)
west_cost=west_cost.astype(np.int32)
east=pd.DataFrame(
{
"id": np.arange(0, num_rows_left),
"dur": east_dur,
"rev": east_rev,
"cores": rng.integers(1, 10, num_rows_left),
}
)
west=pd.DataFrame(
{
"t_id": np.arange(0, num_rows_right),
"time": west_time,
"cost": west_cost,
"cores": rng.integers(1, 10, num_rows_right),
}
)
returneast, westeast,west=east_west()
%timeiteastt.conditional_join(westt, ('dur','time','<='),('rev','cost','>='), ('id','t_id', '>'), use_numba=False)
2.61s ± 21msperloop (mean ± std. dev. of7runs, 1loopeach)
%timeiteastt.conditional_join(westt, ('dur','time','<='),('rev','cost','>='), ('id','t_id', '>'), use_numba=True)
11.3s ± 55.8msperloop (mean ± std. dev. of7runs, 1loopeach)
speed difference is large. worth looking into
The text was updated successfully, but these errors were encountered:
can we get some gains via the join optimizer idea from duckdb blog? at the very least, could we get some gains via early filtering based on min/max stats?
adapted from here
speed difference is large. worth looking into
The text was updated successfully, but these errors were encountered: