Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save/load from disk (serializing / marshalling) #163

Open
mratsim opened this issue Nov 26, 2017 · 23 comments
Open

Save/load from disk (serializing / marshalling) #163

mratsim opened this issue Nov 26, 2017 · 23 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Nov 26, 2017

Format to be defined:

Non-binary (will certainly have size issues)

Binary

@mratsim
Copy link
Owner Author

mratsim commented Dec 22, 2017

After further research including Thrift, Avro, Cap'n Proto and FlatBuffers, I've concluded that a binary serializer with schema would be best:

  • It benefits from Nim static typing and prevents misparsing/interpretation as wrong type.
  • You can (probably) prefetch, parallel read, read from disk because data is at computable offsets.
  • Binary is smaller by default (though a .gz pass might be useful as well)

@mratsim
Copy link
Owner Author

mratsim commented May 6, 2018

Suggested by @sherjilozair, .npy file support would be great.

Description , specs and C utilities here:

In the same vein, it will be useful to be able to load common files like:

NESM also should be considered, not sure it can be used for memory-mapping though: https://xomachine.github.io/NESM/

mratsim added a commit that referenced this issue May 9, 2018
mratsim added a commit that referenced this issue May 9, 2018
* Add Numpy .npy file reader see #163

* Add Numpy writer

* Add tests for .npy writer

* openFileStream is only on devel + enable numpy tests

* SomeFloat is only on devel as well ...
@Vindaar
Copy link
Collaborator

Vindaar commented May 17, 2018

I'll try to find some time in the near future to allow loading of .hdf5 files using nimhdf5 (arraymancer is a dependency anyways). Do you know of any examples of neural networks stored in an .hdf5 file, which I can use as reference?

@sherjilozair
Copy link

@Vindaar
Copy link
Collaborator

Vindaar commented May 17, 2018

Sweet, thanks!

@smartmic
Copy link

This issue is still open and I am wondering… what would be the canonical way to save/load a neural network defined in Arraymancer in early 2020? HDF5, msgpack, … ? Will there be an interface for serialization of NNs for deployments or do I have to define my own structure? An example for noobs like me would be really helpful though, but I will try also myself.

@mratsim
Copy link
Owner Author

mratsim commented Nov 21, 2019

Currently there is no model-wide saving support.
For the future it will probably be HDF5 and/or ONNX/Protobuf.

For individual tensors you can use Numpy or HDF5.

See the tests for usage:

HDF5

withFile(test_write_file):
# write 1D tensor and read back
expected_1d.write_hdf5(test_write_file)
let a = read_hdf5[int64](test_write_file)
check a == expected_1d
withFile(test_write_file):
# write an 1D integer tensor and read back as float
expected_1d.write_hdf5(test_write_file)
let a = read_hdf5[float64](test_write_file)
check a == expected_1d_float
withFile(test_write_file):
# write 2d tensor and read back
expected_2d.write_hdf5(test_write_file)
let a = read_hdf5[int64](test_write_file)
check a == expected_2d
withFile(test_write_file):
# write a 2D integer tensor and read back as float
expected_2d.write_hdf5(test_write_file, name = tensorName)
let a = read_hdf5[float64](test_write_file, name = tensorName)
check a == expected_2d_float
withFile(test_write_file):
# write 4D tensor and read back
expected_4d.write_hdf5(test_write_file, name = name4D)
let a = read_hdf5[int64](test_write_file, name = name4D)
check a == expected_4d
withFile(test_write_file):
# write tensor to subgroup and read back
expected_2d.write_hdf5(test_write_file, group = groupName)
let a = read_hdf5[int64](test_write_file, group = groupName)
check a == expected_2d
withFile(test_write_file):
# write several tensors one after another, read them back
expected_1d.write_hdf5(test_write_file)
expected_1d_float.write_hdf5(test_write_file)
expected_2d.write_hdf5(test_write_file)
expected_2d_float.write_hdf5(test_write_file)
expected_4d.write_hdf5(test_write_file)
let a = read_hdf5[int64](test_write_file, number = 0)
check a == expected_1d
let b = read_hdf5[float64](test_write_file, number = 1)
check b == expected_1d_float
let c = read_hdf5[int64](test_write_file, number = 2)
check c == expected_2d
let d = read_hdf5[float64](test_write_file, number = 3)
check d == expected_2d_float
let e = read_hdf5[int64](test_write_file, number = 4)
check e == expected_4d

Numpy

test "[IO] Reading numpy files with numeric ndarrays":
# Reading from an int64 - little endian
block:
let a = read_npy[int64](folder & "int.npy")
check: a == expected_1d
block:
let a = read_npy[int64](folder & "int_2D_c.npy")
check: a == expected_2d
block:
let a = read_npy[int64](folder & "int_2D_f.npy")
check: a == expected_2d
# Reading from a float32 - little endian (converted to int)
block:
let a = read_npy[int64](folder & "f32LE.npy")
check: a == expected_1d
block:
let a = read_npy[int64](folder & "f32LE_2D_c.npy")
check: a == expected_2d
block:
let a = read_npy[int64](folder & "f32LE_2D_f.npy")
check: a == expected_2d
# Reading from an uint64 - big endian (converted to int)
block:
let a = read_npy[int64](folder & "u64BE.npy")
check: a == expected_1d
block:
let a = read_npy[int64](folder & "u64BE_2D_c.npy")
check: a == expected_2d
block:
let a = read_npy[int64](folder & "u64BE_2D_f.npy")
check: a == expected_2d
test "[IO] Arraymancer produces the same .npy files as Numpy":
when system.cpuEndian == littleEndian:
# int64 - littleEndian
block:
expected_1d.write_npy(test_write_file)
check: sameFileContent(test_write_file, folder & "int.npy")
block:
expected_2d.write_npy(test_write_file)
check: sameFileContent(test_write_file, folder & "int_2D_c.npy")
block:
expected_2d.asContiguous(colMajor, force = true).write_npy(test_write_file)
check: sameFileContent(test_write_file, folder & "int_2D_f.npy")
# float32 - littleEndian
block:
expected_1d.astype(float32).write_npy(test_write_file)
check: sameFileContent(test_write_file, folder & "f32LE.npy")
block:
expected_2d.astype(float32).write_npy(test_write_file)
check: sameFileContent(test_write_file, folder & "f32LE_2D_c.npy")
block:
expected_2d.astype(float32).asContiguous(colMajor, force = true).write_npy(test_write_file)
check: sameFileContent(test_write_file, folder & "f32LE_2D_f.npy")


The best way forward would be to implement a serializer to HDF5 that can (de)serialize any type only made only of tensors, including with nested types support.
Then when saving a model we can pass it to that serializer.


I didn't work on the NN part of Arraymancer during 2019 because I'm revamping the low-level routines in Laser and will adopt a compiler approach: https://github.com/numforge/laser/tree/master/laser/lux_compiler. This is to avoid maintaining a code path for CPU, Cuda, OpenCL, Metal, ...

And also I'm working on Nim core multithreading routines to provide a high-level efficient, lightweight and composable foundation to build multithreaded programs like Arraymancer on top as I've had and still have multiple OpenMP issues (see: https://github.com/mratsim/weave).

And as I have a full-time job as well, I'm really short on time to tackle issues that require careful design and usability tradeoffs.

@arkocal
Copy link

arkocal commented Jun 5, 2020

For practicality, it could make sense to provide this functionality in a limited manner (will be explained below) rather than waiting to come up with a perfect solution that covers all cases, especially considering that is a very central feature and the issue is open since almost three years. I would be interested in helping you with it.

I think it is much more critical to save model parameters than network topology and context information, for example by providing a function that is similar to pytorchs .save for state_dicts. (as described here). The network itself should not be contained in the saved file, it is defined programmatically and is not necessarily part of the data. Losing context information is in general also not a big deal, especially if the trained model is being saved to be run later and not trained further (which I guess is a common use-case). So the limited solution would be just to save this, and when loading, create the model from scratch using the same code and loading the parameters. This can be implemented externally without changing the already functioning parts, or by modifying the network macro to generate save/load functions.

The major problem with this approach is the way the network macro is implemented. As the context is given a variable, the model cannot be simply reused. I cannot really understand the logic behind this, but please let me know if I am missing something. The network macro is creating a type, that could otherwise be re-used, but is reduced to a singleton because of the context. Would not it be better to create the type so that every object creates its own context, or the context is assigned recursively down the tensors with a function? Such a function would be a workaround for the model re-usability problem and also a good basis for an alternative implementation if the described change is desired.

Please let me know if you would be interested in a solution like this, if yes, I would gladly take the issue and provide a more concrete design before moving on with an implementation.

@arkocal
Copy link

arkocal commented Jun 5, 2020

@mratsim Forgot to mention.

Thanks for the great library by the way.

@elcritch
Copy link

Ran into this too, running the ex02_handwritten_digits_recognition.nim example. I use MsgPack a lot, and so tried msgpack4nim. It blew up, but after reading this I dove into a bit more. After a little experimenting it seems there's an easy way to save the model weights by just adding a couple of helper classes to msgpack4nim for storing the different layer types. Really simple actually!

Here's all that's needed for the handwritten digits example. Define helpers for Conv2DLayer, and LinearLyaer for the the msgpack4nim library and you can do the standard load/save from that library. The resulting trained file is ~22 Mbytes. The limitation, as noted above, is that you need the code defining the model for this to work. Still, it's useful.

import arraymancer, streams, msgpack4nim

proc pack_type*[ByteStream](s: ByteStream, layer: Conv2DLayer[Tensor[float32]]) =
  let weight: Tensor[float32] = layer.weight.value
  let bias: Tensor[float32] = layer.bias.value
  s.pack(weight) # let the compiler decide
  s.pack(bias) # let the compiler decide

proc unpack_type*[ByteStream](s: ByteStream, layer: var Conv2DLayer[Tensor[float32]]) =
  s.unpack(layer.weight.value)
  s.unpack(layer.bias.value)

proc pack_type*[ByteStream](s: ByteStream, layer: LinearLayer[Tensor[float32]]) =
  let weight: Tensor[float32] = layer.weight.value
  let bias: Tensor[float32] = layer.bias.value
  s.pack(weight) # let the compiler decide
  s.pack(bias) # let the compiler decide

proc unpack_type*[ByteStream](s: ByteStream, layer: var LinearLayer[Tensor[float32]]) =
  s.unpack(layer.weight.value)
  s.unpack(layer.bias.value)

proc loadData*[T](data: var T, fl: string) =
  var ss = newFileStream(fl, fmRead)
  if not ss.isNil():
    ss.unpack(data) 
    ss.close()
  else:
    raise newException(ValueError, "no such file?")

proc saveData*[T](data: T, fl: string) =
  var ss = newFileStream(fl, fmWrite)
  if not ss.isNil():
    ss.pack(data) 
    ss.close()

Then calling saveData saves the whole model:

var model = ctx.init(DemoNet)
# ... train model ... 
model.saveData("test_model.mpack")
## restart model
model.loadData("test_model.mpack")
## continues at last training accuracy

@elcritch
Copy link

A note on the above, MsgPack does pretty well in size compared to pure JSON. The exported msgpack file from above is ~22MB (or 16MB when bzipped), or when converted to JSON it results in an 87MB file (33M when bzipped). Not sure how HDF5 or npy would compare. Probably similar, unless the Tensor type was converted from float32's or some other optimizations occur.

@elcritch
Copy link

elcritch commented Sep 18, 2020

I'm running into what looks to be incomplete saving of a trained model. Saving a fully trained DemoNet model (with 90+% accuracy) using the previously described msgpack4nim method then reloading the model and running the validation/accuracy testing section results in only about 6% accuracy.

The msgpack4nim library uses the object fields to know what to serialize. Iterating over fieldPairs(model) for the DemoNet (

) model only prints out fields for: "hidden", "classifier", "cv1", and "cv2". It's missing "x" (Input), "mp1" (MaxPool2D), "fl" (Flatten).

Originally I though those must not have state and therefore not need to be stored. But now with the serialize/deserialize not working as intended I am not sure. Is there any other state that I would need to ensure is saved to fully serialize a model and de-serialize it? Perhaps the de-serializing isn't re-packing the all the correct fields?

Here are the "custom" type overrides for the serialized layers for reference:

import arraymancer, streams, msgpack4nim

proc pack_type*[ByteStream](s: ByteStream, layer: Conv2DLayer[Tensor[float32]]) =
  let weight: Tensor[float32] = layer.weight.value
  let bias: Tensor[float32] = layer.bias.value
  s.pack(weight) # let the compiler decide
  s.pack(bias) # let the compiler decide

proc unpack_type*[ByteStream](s: ByteStream, layer: var Conv2DLayer[Tensor[float32]]) =
  s.unpack(layer.weight.value)
  s.unpack(layer.bias.value)

proc pack_type*[ByteStream](s: ByteStream, layer: LinearLayer[Tensor[float32]]) =
  let weight: Tensor[float32] = layer.weight.value
  let bias: Tensor[float32] = layer.bias.value
  s.pack(weight) # let the compiler decide
  s.pack(bias) # let the compiler decide

proc unpack_type*[ByteStream](s: ByteStream, layer: var LinearLayer[Tensor[float32]]) =
  s.unpack(layer.weight.value)
  s.unpack(layer.bias.value)

@mratsim
Copy link
Owner Author

mratsim commented Sep 19, 2020

  • Input has no state it's not a weight, it's just to describe the Input shape to the network and properly derive everything else.

AFAIK you're doing the correct thing for weights/bias:

TrainableLayer*[TT] = concept layer
block:
var trainable: false
for field in fields(layer):
trainable = trainable or (field is Variable[TT])
trainable
Conv2DLayer*[TT] = object
weight*: Variable[TT]
bias*: Variable[TT]
LinearLayer*[TT] = object
weight*: Variable[TT]
bias*: Variable[TT]
GRULayer*[TT] = object
W3s0*, W3sN*: Variable[TT]
U3s*: Variable[TT]
bW3s*, bU3s*: Variable[TT]
EmbeddingLayer*[TT] = object
weight*: Variable[TT]

For the other I don't store the shape metadata in the layers (they are compile-time transformed away)

LayerTopology* = object
## Describe a layer topology
in_shape*, out_shape*: NimNode # Input and output shape
case kind*: LayerKind
of lkConv2D:
c2d_kernel_shape*: NimNode
c2d_padding*, c2d_strides*: NimNode
of lkMaxPool2D:
m2d_kernel*, m2d_padding*, m2d_strides*: NimNode
of lkGRU:
gru_seq_len*: NimNode
gru_hidden_size*: NimNode
gru_nb_layers*: NimNode
else:
discard

but I probably should to ease serialization

@elcritch
Copy link

Ok, thanks that's good to know the weights/biases seem correct. There's a good chance I am missing a part of the serialization or messing up the prediction. All of the re-serialized Tensors values appear to be correct.

One last question, is there anything special for the Variable[T] wrappers? Currently I'm instantiating a new instance of the model from the model:

var model = ctx.init(DemoNet)
...
model.loadData(model_file_path)

The loadData will unpack all of the fields (by copying or setting fields I presume), could that be messing up variable contexts somehow? I wouldn't think so, but some of the nuances of Nim regarding references and copies.

@elcritch
Copy link

For the other I don't store the shape metadata in the layers (they are compile-time transformed away)

Eventually that would be nice. I currently am just redefining the model code which works for my use case.

@mratsim
Copy link
Owner Author

mratsim commented Sep 20, 2020

Variable stores the following

Variable*[TT] = ref object # {.acyclic.}
## A variable is a wrapper for Tensors that tracks operations applied to it.
## It consists of:
## - A weak reference to a record of operations ``context``
## - The tensor being tracked ``value``
## - The gradient of the tensor ``grad``
## - a flag that indicates if gradient is needed
context*: Context[TT]
# Variables shouldn't own their Context
value*: TT
grad*: TT
requires_grad*: bool

The Context starts empty

type
Context*[TT] = ref object # {.acyclic.}
## An autograd context is a record of operations or layers.
## It holds the following fields:
## - ``nodes``: This records the list of operations(``Node``) applied in the context
## - ``no_grad``: This disable tracing the list of operations altogether.
## This is useful to save memory when you don't need the gradient
## (for validation or prediction for example)
##
## A context is also called a tape or a Wengert list.
##
## Note: backpropagation empties the list of operations.
nodes: seq[Node[TT]]
no_grad: bool

and then as we pass through layers, a record of the layers applied is appended to Context.node. no_grad is a runtime flag to activate/deactivate recording depending on traiing or inference so no need to save that.

The value field is the actual weight and must be saved.

The grad field is not important, it is used for accumulating the gradient of the layer in backpropagation when requires_grad is set to true. Then the optimizer (SGD, Adam) will multiply it by the learning rate (for SGD) or something more advanced (for Adam) and substract the gradient from the value field.
It is always zero-ed on training so no need to serialize it: https://github.com/mratsim/Arraymancer/blob/1a2422a1/src/arraymancer/autograd/gates_blas.nim#L18-L58

@elcritch
Copy link

Thanks, I tried reading the dsl.nim but got lost on where things were defined.

Based on that those code snippets, the only place that I'm not sure is setup correctly is context, perhaps it's getting set incorrectly. The grad, requires_grad shouldn't be needed. They're probably being overwritten, but it sounds like it shouldn't matter in the ctx.no_grad_mode where I'm doing the prediction.

If I understand it correctly, the nodes isn't used for computation or doing a model.forward right?

@elcritch
Copy link

elcritch commented Sep 21, 2020

I'm only saving the model after training. Are any of the above stored in the ctx?

edit: Looking through this more, it doesn't appear so. I think the model is being saved/restored correctly. It may be a bug in how I'm ordering my tests when doing predictions. The squeeze/unsqueeze operations didn't work well on my 1 items labels.

@disruptek
Copy link

Probably not terribly useful long-term, but for rough purposes you might try https://github.com/disruptek/frosty. It’s kinda designed for “I know what I’m doing” hacks and it could help your differential diagnosis.

I used msgpack4nim but wanted more of a fire-and-forget solution that I could trust.

@forest1102
Copy link

Any Update?
How can I save and retrieve a model's weights?

1 similar comment
@forest1102
Copy link

Any Update?
How can I save and retrieve a model's weights?

@Niminem
Copy link
Contributor

Niminem commented Aug 11, 2021

I'm so at a loss for saving and loading models.. respectfully, how are we supposed to use arraymancer for deep learning without being able to do this?

@Niminem
Copy link
Contributor

Niminem commented Aug 12, 2021

Things I learned from trying to solve this problem all day, hope it helps someone:

In order to save/load weights and biases of your model, you'll first need to define these manually-

  1. Layer types
  2. Network type
  3. weight and bias initializations
  4. Network init proc
  5. forward proc

working test example:
`
type
LinearLayer = object
weight: Variable[Tensor[float32]]
bias: Variable[Tensor[float32]]
ExampleNetwork = object
hidden: LinearLayer
output: LinearLayer

template weightInit(shape: varargs[int], init_kind: untyped): Variable =
ctx.variable(
init_kind(shape, float32),
requires_grad = true)

proc newExampleNetwork(ctx: Context[Tensor[float32]]): ExampleNetwork =
result.hidden.weight = weightInit(HIDDEN_D, INPUT_D, kaiming_normal)
result.hidden.bias = ctx.variable(zeros[float32](1, HIDDEN_D), requires_grad = true)
result.output.weight = weightInit(OUTPUT_D, HIDDEN_D, yann_normal)
result.output.bias = ctx.variable(zeros[float32](1, OUTPUT_D), requires_grad = true)

proc forward(network: ExampleNetwork, x: Variable): Variable =
result = x.linear(
network.hidden.weight, network.hidden.bias).relu.linear(
network.output.weight, network.output.bias)
`

Then, you'll need to create your save/load procs. I'll save you the headache here as well- use numpy files. Long story short, forget about hdf5.. and the others aren't as efficient.

working test example:
`proc save(network: ExampleNetwork) =
network.hidden.weight.value.write_npy("hiddenweight.npy")
network.hidden.bias.value.write_npy("hiddenbias.npy")
network.output.weight.value.write_npy("outputweight.npy")
network.output.bias.value.write_npy("outputbias.npy")

proc load(ctx: Context[Tensor[float32]]): ExampleNetwork =
result.hidden.weight = ctx.variable(read_npyfloat32, requires_grad = true)
result.hidden.bias = ctx.variable(read_npyfloat32, requires_grad = true)
result.output.weight = ctx.variable(read_npyfloat32, requires_grad = true)
result.output.bias = ctx.variable(read_npyfloat32, requires_grad = true)`

At some point in the future I'll work on getting the network macro to integrate loading and saving models but for now, this POC/example should help push you in the right direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants