- Interface files for
Backends
andLow_level
. - Fixed #245: tracking of used memory. But there's room for improvement.
- Stream-to-stream synchronization functionality, with lazy per-tensor-node synchronization.
- Migrated to cudajit 0.6.1.
- Verifying that code is linked with the right contexts, by tracking
embedded_nodes
with assignments. - Renaming: (virtual)
device
->stream
,physical_device
->device
. - New files: split out
backend_intf.ml
,backend_impl.ml
,schedulers.ml
frombackends.ml
; movedTnode.task
totask.ml
; renamedbackend_utils.ml
toc_syntax.ml
. - Removed half-static verification of merge buffer nodes inside
device_to_device
. - Fixed #286: cross-stream-sharing incorporated into
Tnode.memory_mode
. - Moved the multicore backend from a
device = stream
model to a single device model. - Got rid of
unsafe_cleanup
. - Rename
subordinal
tostream_id
. - Removed dependency on
core
, broke up dependency onppx_jane
. - Huge refactoring of backend internal interfaces and API (not repeating same code).
- Built per-tensor-node stream-to-stream synchronization into copying functions.
- Re-introduced whole-device blocking synchronization, which now is just a slight optimization as it also cleans up event book-keeping.
- Simplifications: no more explicit compilation postponing; no more hard-coded pointers (all non-local arrays are passed by parameter).
- Fresh backends are now fresh modules to structurally prevent any potential cache leaking.
- Validating merge nodes for the CUDA backend.
- Checking
is_released
on weak array retrieval.
- Implemented the previously-mocked support for half precision (FP16).
- We work around the missing Ctypes coverage by not using
Ctypes.bigarray_start
. - We check FP16 constants for overflow.
- We output half precision specific code from the CUDA backend.
- We work around the missing Ctypes coverage by not using
- Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via
Tnode.update_prec
. - A placeholder
nn_blocks.ml
hinting at an intended design pattern for model components. - A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend. It fixes the CUDA backend behavior in the data parallelism benchmark.
- Slides for the Fun OCaml meetup: docs/Fun OCaml.
- New syntax: inline tensor declarations with a literal float as initial value.
- Removed the
pipes_cc, pipes_gccjit
backends (Pipes_multicore_backend
) -- I had fixedPipes_multicore_backend
by using thepoll
library instead ofUnix.select
, but it turns out to be very very slow. - Changed the
%cd
block comment syntax~~
to allow detailed structuring. RewroteTrain.grad_update
to use the%cd
syntax. - Made
Train.sgd_one
slightly more thrifty:p =- learning_rate *. sgd_delta
-->p =- learning_rate * sgd_delta ~logic:"."
without the inline tensor expression.
- Log levels related de-confusion:
- Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
- Properly restore
log_level
and inform about its setting. - By default do not log from tests.
debug_log_from_routines
should only happen whenlog_level > 1
.
- Bugs in
Multicore_backend
:await
was not checking queue emptiness,worker
'sCondition.broadcast
was non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced withsaturn_lockfree
. - Reduced busy-waiting inside
c_compile_and_load
, propagating compilation errors now instead of infinite loop on error. - Fixed loss of significant digits for small numbers when outputting files.
- Added missing mixed-precision conversions in the
C_syntax
backend builder. - Restored the functionality of debug logging from the cuda backend.
- Always reinitialize global state at the beginning of
let%expect_test
, to make them more deterministic.
- A new backend "cc": C based on a configurable C compiler command, defaulting to
cc
. - Merge buffers representational abstraction (one per virtual device):
- backends just need to support device-to-device transfers,
- merging gets implemented in "user space".
- CUDA streaming multiprocessor parallelism via streams <-> virtual devices.
- Support for
cuda-gdb
andcompute-sanitizer
(pass the right arguments to cudajit). - Inline declarations for (non-differentiable) tensors in the
%cd
syntax. - A minimal wrapper
Sync_backend
creating CPU backends with a single device only, where all calls are synchronous. (It's a baseline and helps debugging.) - In progress: proper (condition variables based) scheduler. The legacy scheduler (pipes based) kept for now as baseline and to help debugging.
- Documentation for the syntax extensions.
%op
syntax: when under a~config
parameter, refine the inline declared params' labels withconfig.label
.%op
syntax: incorporate the input tensor's (if any) label in the resulting tensor's label.- Comments in config files using the line prefix
~~
.
- Terminology in the API: Renamed almost all uses of "jit" into uses of "compile" and / or "link".
- Split the compile-to-ptx phase from the build-module and build-kernel-launcher phase.
- Migrated the CUDA backend to ppx_minidebug-based execution tracing.
- Fixes for mixed precision computations.
- Further terminology refactoring: Renamed
Low_level.compile
toLow_level.lower
;- and
Low_level.compiled
toLow_level.optimized
, making it a record.
- and
- Further refactoring of the
Backends
API:- split the
device
type into virtualdevice
andphysical_device
, - removed the direct support for
merge
, instead relying on merge buffers.
- split the
- Updated to cudajit 0.4.
- A template for C-syntax backends, refactoring CC and CUDA backends.
- Improvements to handling of tensor node labels, and to the
Tnode.debug_name
function. - Output files generated by backends, and files generated by logging, in separate subdirectories.
- C-syntax logging: also output the pre-assignment value when logging an assignment.
- Migrated to ppx_minidebug 2.0 with the benefits it brings: no runtime passing,
Utils.settings.log_level
unified with ppx_minidebug's log levels.
- Allow verifying that non-embedded tensor nodes of the tensor(s) associated with a linked code are already in the context passed to
link
(resp.link_batch
), since they won't get introduced into the context. It is the responsibility of helper functions (such as those inTrain
) to ensure the check. - Fixed both known and newly discovered shortcomings of the syntax extensions.
- In particular,
%op
syntax: lift~config
applications out of (tensor) functions. - Multiple other tiny fixes.
- GitHub workflow for continuous integration and API docs.
- Randomness plug-ins via global config
randomness_lib
: currently onlystdlib
andfor_tests
.
- A bit of code rot in the Cuda backend mock
cuda_backend.missing.ml
. - NPY: Compatibility with OCaml 5.2.0.
- Renamed the main package name from
ocannl
toneural_nets_lib
, to prevent the opam linter from complaining about a confusing name.
let%cd _ =
(andlet%op _ =
?) do not affect root tracking (intended for adding shape constraints).- More expressive shape constraints: allowing row variables to be sandwiched between leftmost axes
beg_dims
and rightmost axesdims
. - Einsum notation support for leftmost axes.
- Cleaned up "user-facing" API by moving
IDX
andCDSL
toTrain
, andTensor.O
to more preciseOperation.At
. - Added interface
Tensor.mli
to reduce "the user learning surface". - Improved documentation and layout of
Shape.mli
. - A more reasonable syntax for labels specifications and einsum notation. In particular, whitespace insensitive (except whitespace not allowed inside identifiers).
- Vendored the
npy
package while we wait for a PR.
- Moved
cudajit
todepopts
. - Slice shape inference is now complete, by using leftmost axes
beg_dims
in constraints.
- Tensor parameters saving and restoring, Ndarray saving and restoring.
- An operation
outer_sum
: likeeinsum
but simpler, addition everywhere.
- Tweaks to make the project usable as a package (external library).
- Sanitizing code inclusion via code roots management:
Tensor.consume_forward_code
andconsume_backprop_code
, (optionally but by default) used fromTrain
.
- Shape inference in presence of non-0 fixed indexing inside einsums was broken (because actually not implemented).
- Incompleteness of shape inference for slicing was leading to inferring shapes with no axes: constraint generation was intended to raise a shape error instead. Proper fix coming in 0.3.2 will make slice shape inference complete.
Major rewrite. Abandoning the design choices of 0.1 and 0.2.
- Optionally, inferring or checking tensor (batch) sizes from data (e.g. file) sizes.
- Static indexing. A "slice" operator to select individual batches.
- Established the backends API with first-class modules.
- The
Train
module as an optimization "frontend". - Parallel optimization across devices.
- Global settings configurable via config files, environment variables, and commandline flags.
- Integration of backend logging with
ppx_minidebug
(thedebug_log_from_routines
setting).
- The Cuda backend is not supported for now. It is (optionally) buildable to reduce code rot.
- Dynamic indexing is not supported anymore (to reduce complexity). It might be reintroduced if needed.
- Factored out the
arrayjit
library / package containing compilation (former Ndarray, Node, Code). - Renamed
Formula
->Tensor
- No more "form vs. non-form" formulas / tensors.
- Formula/tensor roots are split into forward roots and backprop roots.
- No more
%nn_rs
,%nn_dt
syntaxes andSynthetic
fetch primitive. - Renamed
%nn_op
to%op
and%nn_cd
to%cd
. - Migrated
gccjit
into a separate repository. - Migrated
cudajit
into a separate repository. - Massive rewrite of shape inference in a declarative style.
- Generalize
zero_out
toinitialize_neutral
to prepare arbitrary accumulation operation. - Renamed
Node
->Lazy_array
->Tnode
(tensor node).
- The Cuda backend.
- The Cudajit interface based on Nvrtc and the Cuda driver API.
- A naive
Exec_as_cuda
backend where the dedicatedTask_id
axis parallelizes over blocks, and a new dedicatedSample_num
axis parallelizes over threads in a block. - When outputting debug files, stores the source
.cu
code and the assembly.ptx
code. - Supports thread-only tensors, tensors with thread-local "replicated" working copies, constant tensors, and globally updated tensors.
- The backend uses atomic adds for shared updates, and within-block synchronization to minimize update races and parameter staleness.
- Debugging: full trace (for thread 0) by logging assignments with the assigned value and indices for the LHS tensor and the RHS tensors, the expression used to compute the assigned value, of values of subexpressions.
- Cuda FFI for retrieving GPU specs and for getting and setting limits.
Zero_out
low-level-code primitive usingmemset
.Staged_compilation
low-level-code primitive: a (stateful) callback for use by backends.- When outputting debug files, also stores the high-level code.
- Saving and restoring tensor content to
.npz
(.npy
archive) files (untested). - Low-level code based optimizations:
- unrolls
ToPowOf
with integer exponent, - simplifies local computations that are just expressions,
- some arithmetic simplifications.
- unrolls
- Monomorphic
axis_index
, simplified the axes-related types. - Splits
'a low_level
into monomorphicunit_low_level
andfloat_low_level
. - Removes integer bigarray types.
- Refactors
Node
+NodeUI
intoNdarray
+Node
. - Tensor printouts include whether a tensor contains
NaN
orinfinity
. - Simplifies the
Task_id
functionality: removesIf_task_id_is
andGlobal Task_id
; emoves parallelism frominterpret_code
; removestask_id_func
vsunit_func
duplication.
- "Non-diff" code inclusion.
- Ensures unique indices/symbols also for the
task_id
andsample_num
bindings. - Removes endlines from
PrintBox_utils
benchmark tables cells.
- The Gccjit backend operates using "on device" copies of tensors, where the "device memory" is the stack of the C function. This is intended to improve cache locality and reduce cache contention.
- Three / four synchronization heuristics:
- "parallel": a slice of the tensor is copied host-to-device at the beginning and device-to-host at the end, without interference because each task has a different slice.
- "update on host": the tensor is copied host-to-device at the beginning; each write is an update, it reads the old value from host to update it on the host. Thus each write is a synchronization point.
- "replicated": the tensor is copied host-to-device at the beginning; only task 0 copies device-to-host.
- "device-only": no copying to/from host.
- Three / four synchronization heuristics:
- On-device-only tensors that are not materialized on the OCaml side.
- A new category of axis dimensions is introduced:
Frozen
. It is analogous to theParallel
axis category in that a single task execution / "device call" only processes a 1D slice of the axis.- Currently, for tensors processed in parallel, we only support processing of a contiguous tensor slice (copied "to device" using
memcpy
).
- Currently, for tensors processed in parallel, we only support processing of a contiguous tensor slice (copied "to device" using
- A new syntax
%nn_rs
("postprocess results" variant of%nn_dt
) for computations that should happen at the end of task execution / refresh step. It's meant to prepare the data to be copied back to the host.
- Got rid of backend-agnostic synchronization. It was not worth the complexity / implementation effort at this point.
- Keeping the
Rebalance
constructor around, but it is not playing any role.
- Keeping the
- Got rid of
debug_virtual_nodes
, was tricky to maintain. - Dynamic indexing now skips over parallel axes: when there is a
Parallel
axis on the left, it is preserved in the resulting tensor (slice), and the next-right axis is indexed into instead.- Removed the "indexing axes from-right" functionality for now (fails as not implemented).
- Dynamic indexing now can produce virtual nodes.
- Dynamic indexing fixes.
- Thread-local parameter
task_id
for automated iteration over a dimensionParallel
.- This implements multicore SGD.
- Rebalancing of computations that don't use
Parallel
, and synchronization in theGccjit
backend, are left as future work. - Already provides significant speedups in the interpreter (6-7x for me), but that's a moot point.
- Giving up further work this approach for now, because the bottleneck is the memory access with
Gccjit
. - Keeping the new representation capability around, maybe it will be a stepping stone to other things.
- Monolithic step update with "macrobatch" (multiple steps within one backend call).
- Streamlined the source code, e.g. removed the
OCaml
backend. - Better syntax for
%nn_dt
and%nn_op
shape specification, allows identifiers. - Improved virtual node and scalar constant inlining.
- Better debugging, e.g. an option to "trace"
Gccjit
execution by printing the comments.
- An inline constants optimization that compile-time computes scalar constant subexpressions and inlines the values.
- Improved debuggability.
- A last-minute breaking bug (would be nice to have a pre-release or a pre-publish hook to run tests!).
- The virtual nodes optimization is more robust, correct even with aggressive inlining settings (e.g. escaping variables check).
- The first changes-tracking release. Earlier development history is still somewhat documented via closed issues.
- Supports single and double precision floats, more precisions in the future.
- Generates a monolithic step update routine executed by
refresh_session ()
, but can generate arbitrary additional routines at arbitrary times to be executed at arbitrary other times within a session. - An
Interpreter
backend that can for example log all individual tensor modifications. - A
Gccjit
backend that can sometimes be 400x faster than theInterpreter
backend (without any debug work/output). - A virtual nodes (tensors) optimization that inlines computation of a cell in lieu of tensor accesses, can sometimes reduce memory consumption by 1/3.