-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
👥 Compare data loading between kvikIO and Zarr engine #7
Conversation
Benchmark results between the Zarr and kvikIO engine were too close for one epoch, so looping over 10 epochs and reporting the average instead. Not printing the MSE Loss anymore to declutter the console output.
Jupyter Interactive Widgets! Repo at https://github.com/jupyter-widgets/ipywidgets
Will be reusing some of this code in a Jupyter Notebook, so refactoring to use tqdm.auto instead of standard tqdm.
Save the time taken to complete each epoch, and compute the median, mean and standard deviation across all epochs. Needed because the time to process one epoch can vary by a few seconds across the ten epochs depending on various factors (e.g. caching), so computing the average time as total_time / num_epochs can lead to misleading results. Also updated main README.md to say be more specific about the reported total/median/mean/std benchmark times and the size of the ERA5 subset dataset.
Reporting the actual numbers on which is faster - kvikIO or Zarr! Reusing some code from 1_benchmark_kvikIOzarr.py, but now the total/median/mean/std times can be displayed. Final cell calculates the speedup of kvikIO to be ~20% over Zarr, but note that this speedup can actually fluctuate depending on lots of factors (have seen values from 10%-30% over multiple runs).
Statistical data visualization in Python!
A bar plot (with error bars) to visually compare kvikio (with GPUDirect Storage) against the zarr (no GPUDirect Storage) xarray backend engines in terms of data loading speed. Speedup results still fluctuates between runs, but are mostly around the 20% mark. Also did some slight refactoring to use pandas instead of numpy for the mean/median/std calculations. Using ddof=1 for the standard deviation.
2_compare_results.ipynb
Outdated
"sns.set_theme(context=\"talk\", palette=[\"#7400ff\", \"#e01073\"])\n", | ||
"ax = sns.barplot(data=df)\n", | ||
"for container in ax.containers:\n", | ||
" ax.bar_label(container=container, fontsize=11, fmt=\"%.1fs\", label_type=\"center\")\n", | ||
"ax.set_ylabel(ylabel=\"Data load time per epoch\\n ◀ seconds, lower is better\")\n", | ||
"ax.set_xlabel(\n", | ||
" xlabel=\" (with GDS) (without GDS) \\n\\n xarray backend engine\"\n", | ||
")\n", | ||
"ax.set_title(label=\"Reading ERA5 data with/without GPUDirect Storage\")\n", | ||
"fig = ax.get_figure()\n", | ||
"fig.savefig(fname=\"figures/compare_kvikio_zarr.svg\", bbox_inches=\"tight\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, this is the results when running from my freshly booted system (cold start):
The kvikIO
engine shows:
- Median: 12.1870 s
- Mean ± Standard deviation: 14.2480 ± 6.6434 seconds/epoch
The zarr
engine shows:
- Median: 15.9633 seconds/epoch
- Mean ± Standard deviation: 16.0398 ± 0.4271 seconds/epoch
which is why I switched to reporting the Median time instead of the Mean time at 25d1642.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run 2, after the GPU has warmed up a bit, and presumably there's some caching going on:
The kvikIO engine shows:
- Median: 11.9764 seconds/epoch
- Mean: 12.2756 ± 1.1680 seconds/epoch
The zarr engine shows:
- Median: 15.7782 seconds/epoch
- Mean: 15.8177 ± 0.2417 seconds/epoch
And I just realized, the bar plot is showing the mean value, not the median 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run 3, after fixing the bar plot to show the median value instead of the mean value.
The kvikIO
engine shows:
- Median: 11.8606 seconds/epoch
- Mean: 12.1578 ± 0.8311 seconds/epoch
The zarr
engine shows:
- Median: 16.0240 seconds/epoch
- Mean: 15.9295 ± 0.3761 seconds/epoch
Gonna go with this one (commit at 6235109) since 16.0s and 11.9s are almost round numbers 🙂
Seaborn plots the mean value by default, but changing to median instead. The kvikIO engine is now reported as 35% faster than the Zarr engine.
Speed is equal to Distance (or epochs) over time. It makes more sense to report 'less time' (absolute measure) instead of 'faster speed' (inverse measure), so fixing the formulation. Previous calculation of speedup may actually have been incorrect?
Have you tried this one with COGs? |
No, |
Getting the hard numbers on how fast the GPU-based kvikIO engine is over the CPU-based Zarr engine.
Formula for calculation:
I.e.
kvikio
engine takes ~25% less time thanzarr
engine to load the ERA5 subset dataset.Note that:
Preview results in
2_compare_results.ipynb
notebook.TODO: