Parallel loading of the model tensors

People have reported faster loading of the models in upstream when the tensors are loaded in parallel: https://github.com/ggerganov/llama.cpp/issues/85

This should be pretty easy to do with Rust if we convert loading to an `iter` and then use `par_iter` instead. It *seems* like this should be I/O bound, but perhaps the actual loading process has computational overhead?