v0.2.0
- Removed async traits and methods.
- Core functionality reimplemented in krnl:
- Only targets Vulkan, more portable than Metal / DX12.
- Metal is supported via MoltenVK.
- GPGPU kernels implemented inline in Rust:
- Kernels can be defined in the same file, near where they are invoked.
- Modules allow sharing code between host and device.
- Kernel bindings are type safe, checked at compile time.
- Simple iterator patterns can be implemented without unsafe.
- Supports specialization constants provided at runtime.
- DeviceInfo includes useful properties:
- Max / default threads per group.
- Max / min threads per subgroup.
- With DebugPrintf, kernel panics produce errors on the host.
- krnlc generates a device crate and invokes spirv-builder.
- spirv-builder / spirv-tools are compiled once on install.
- Significantly streamlines and accelerates workflow.
- Kernels are compressed to reduce package and binary size.
- GPGPU kernels implemented inline in Rust:
- Device operations readily execute:
- Block until kernels / transfers can queue.
- An operation can be queued while another is executing.
- Reduced latency, better repeatability, reliability, and performance.
- Device buffers can be copied by the host if host visible.
- Large buffer copies are streamed rather than allocating a large temporary.
- Reuses a few small buffers for transfers.
- Overlaps host and device copies.
- Performance significantly closer to CUDA.
- Also streams between devices.
- Device buffers can be i32::MAX bytes (~2 GB, up from 256 MB).
- Scalar / ScalarBufferBase replaces Float / FloatBuffer:
- Streamlined conversions between buffers.
- Buffers can be sliced.
- Supports wasm (without device feature).
- TensorBase and ScalarBufferBase implemented with krnl::BufferBase and krnl::ScalarBufferBase:
- Streamlined conversions between tensor types.
- Host ops accelerated with rayon.
- Improved and streamlined device gemm kernel.
- Device sum and sum_axis use subgroup reductions for improved performance.
- Replaced Criterion trait with Accuracy / CrossEntropyLoss traits.
- ops::AddAssign implemented by Tensor and Variable.
- Implement ndarray::linalg::Dot for Tensor and Variable.
- Direct convolution algorithm for better host performance.
- Removed learn::kmeans.
- Redesigned autograd:
- Autograd replaced with VariableBuilder:
- Nodes and edges applied when building a Variable.
- Backward edges are simply f(output_grad) -> input_grad.
- Gradients are automatically accumulated.
- Parameter and Variable are separate types (instead of VertexBase).
- Parameters can be converted to Variables.
- Autograd replaced with VariableBuilder:
- Redesigned Layer trait:
- for_each_parameter fn's instead of returning a Vec.
- Cast layers to a ScalarType.
- Removed enumeration of child layers.
- Redesigned Forward trait:
- Generic over input and output type.
- Derive improvements:
- Removed layer attribute.
- Supports enums.
- Fields can be skipped.
- Redesigned Optimizer trait:
- Added learning rate.
- Accepts a single parameter instead of a slice.
- Parameter optimizer::State:
- Can be serialized / deserialized with serde.
- Simplified Iris dataset.
- MNIST dataset:
- Replaced downloader with curl.
- Decompress in parallel with rayon.
MSRV: 1.70.0