Skip to content

v0.16.0

Latest
Compare
Choose a tag to compare
@laggui laggui released this 14 Jan 21:16
· 16 commits to main since this release

Summary

This release significantly enhances GPU utilization through a new tensor transaction mechanism for batched sync operations and simultaneous reads of multiple bindings for CubeCL runtimes. It also includes multiple performance optimizations like mixed precision support for matrix multiplication and convolution operations, as well as notable GEMM improvements.

Backend capabilities have been expanded with a new remote backend for distributed computing, improved SPIR-V support, custom operations fusion and an experimental fused matrix multiplication.

Training components have been expanded to support semantic segmentation and object detection datasets, new training metrics and improved training performance thanks to an async metric processor.

As with previous releases, this version includes various bug fixes, further performance optimizations, new tensor operations and enhanced documentation.

Module & Tensor

Bug Fixes

  • Fix unsqueeze dims with multiple trailing negative indices (#2496) @laggui
  • Fix one_hot implementation for Int Tensors (#2501) @maun
  • Fix tensor prod and prod dim containing nan values (#2515) @quinton11
  • Expose ItemLazy to be able to implement for custom types (#2525) @laggui
  • Check nonzero stride, dilation and groups (#2540) @laggui
  • Module derive types should inherit visibility (#2610) @laggui
  • Add dropout prob check (#2695) @laggui

Backends

Bug Fixes

Documentation & Examples

Fixes

ONNX Support

Enhancements

Refactoring

Miscellaneous