Release v0.2.0 · charles-r-earp/autograph

Removed async traits and methods.
Core functionality reimplemented in krnl:
- Only targets Vulkan, more portable than Metal / DX12.
- Metal is supported via MoltenVK.
  - GPGPU kernels implemented inline in Rust:
    - Kernels can be defined in the same file, near where they are invoked.
    - Modules allow sharing code between host and device.
    - Kernel bindings are type safe, checked at compile time.
    - Simple iterator patterns can be implemented without unsafe.
    - Supports specialization constants provided at runtime.
    - DeviceInfo includes useful properties:
      - Max / default threads per group.
      - Max / min threads per subgroup.
    - With DebugPrintf, kernel panics produce errors on the host.
    - krnlc generates a device crate and invokes spirv-builder.
      - spirv-builder / spirv-tools are compiled once on install.
      - Significantly streamlines and accelerates workflow.
    - Kernels are compressed to reduce package and binary size.
- Device operations readily execute:
  - Block until kernels / transfers can queue.
  - An operation can be queued while another is executing.
  - Reduced latency, better repeatability, reliability, and performance.
- Device buffers can be copied by the host if host visible.
- Large buffer copies are streamed rather than allocating a large temporary.
  - Reuses a few small buffers for transfers.
  - Overlaps host and device copies.
  - Performance significantly closer to CUDA.
  - Also streams between devices.
- Device buffers can be i32::MAX bytes (~2 GB, up from 256 MB).
- Scalar / ScalarBufferBase replaces Float / FloatBuffer:
  - Streamlined conversions between buffers.
- Buffers can be sliced.
- Supports wasm (without device feature).
TensorBase and ScalarBufferBase implemented with krnl::BufferBase and krnl::ScalarBufferBase:
- Streamlined conversions between tensor types.
- Host ops accelerated with rayon.
- Improved and streamlined device gemm kernel.
- Device sum and sum_axis use subgroup reductions for improved performance.
Replaced Criterion trait with Accuracy / CrossEntropyLoss traits.
ops::AddAssign implemented by Tensor and Variable.
Implement ndarray::linalg::Dot for Tensor and Variable.
Direct convolution algorithm for better host performance.
Removed learn::kmeans.
Redesigned autograd:
- Autograd replaced with VariableBuilder:
  - Nodes and edges applied when building a Variable.
  - Backward edges are simply f(output_grad) -> input_grad.
- Gradients are automatically accumulated.
- Parameter and Variable are separate types (instead of VertexBase).
  - Parameters can be converted to Variables.
Redesigned Layer trait:
- for_each_parameter fn's instead of returning a Vec.
- Cast layers to a ScalarType.
- Removed enumeration of child layers.
Redesigned Forward trait:
- Generic over input and output type.
Derive improvements:
- Removed layer attribute.
- Supports enums.
- Fields can be skipped.
Redesigned Optimizer trait:
- Added learning rate.
- Accepts a single parameter instead of a slice.
Parameter optimizer::State:
- Can be serialized / deserialized with serde.
Simplified Iris dataset.
MNIST dataset:
- Replaced downloader with curl.
- Decompress in parallel with rayon.

MSRV: 1.70.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0