-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrays with ndim != 1 #10
Comments
A note: The public API I've been prototyping for n-dimensional arrays is: Reading The existing API is good enough, we just need a few more getters: pub enum Order { C, Fortran }
impl<'a, T: Deserialize> NpyData<'a, T> {
/// Get the shape as written in the file.
pub fn shape(&self) -> &[usize] {
&self.shape
}
/// Get strides for each of the dimensions.
///
/// This is the amount by which the item index changes as you move along each dimension.
/// It is a function of both [`NpyData::order`] and [`NpyData::shape`],
/// provided for your convenience.
pub fn strides(&self) -> &[usize] {
&self.strides
}
/// Get whether the data is in C order or fortran order.
pub fn order(&self) -> Order {
self.order
}
} Documentation of I did not bother adding anything like e.g. Writing With #16, #17, and this, I found myself wrestling with 2x2x2=8 constructors for NpyWriter. That's far too much! So I introduced a Builder API: /// Configuration for an output `.NPY` file.
pub struct Builder<Row>;
// returns Builder::new()
impl<Row> Default for Builder<Row> { ... }
impl<Row> Builder<Row> {
/// Construct a builder with default configuration.
///
/// Data order will be initially set to C order.
///
/// No dtype will be configured; the [`Builder::dtype`] method **must** be called.
pub fn new() -> Self;
/// Set the data order for arrays with more than one dimension.
pub fn order(mut self, order: Order) -> Self;
/// Use the specified dtype.
pub fn dtype(mut self, dtype: DType) -> Self;
/// Calls [`Builder::dtype`] with the default dtype for the type to be serialized.
pub fn default_dtype(self) -> Self
where Row: AutoSerialize;
}
impl<Row: Serialize> Builder<Row> {
/// Begin writing an array of known shape.
///
/// Panics if [`Builder::dtype`] was not called.
pub fn begin_with_shape<W: Write>(&self, w: W, shape: &[usize]) -> io::Result<NpyWriter<Row, W>>;
/// Begin writing a 1d array, of length to be inferred.
///
/// Panics if [`Builder::dtype`] was not called.
pub fn begin_1d<W: Write + Seek>(&self, w: W) -> io::Result<NpyWriter<Row, W>>;
}
|
Hmmm, eliminating the impl<Row: Serialize> Builder<Row> {
pub fn begin_with_shape<W: Write>(&self, w: W, shape: &[usize])
-> io::Result<NpyWriter<Row, PanicSeek<W>>>; // <--------- ewww, gross
pub fn begin_1d<W: Write + Seek>(&self, w: W)
-> io::Result<NpyWriter<Row, W>>;
} I had a bit of a clever idea to try hiding it with dynamic polymorphism, however, this introduces a lifetime: pub(crate) enum MaybeSeek<'w, W> {
Is(Box<dyn WriteSeek + 'w>),
Isnt(W),
}
impl<Row: Serialize> Builder<Row> {
pub fn begin_with_shape<'w, W: Write + 'w>(&self, w: W, shape: &[usize])
-> io::Result<NpyWriter<'w, Row, W>>;
pub fn begin_1d<'w, W: Write + Seek + 'w>(&self, w: W)
-> io::Result<NpyWriter<'w, Row, W>>;
}
trait WriteSeek: Write + Seek {}
impl<W: Write + Seek> WriteSeek for W {} I feel like it should be possible to remove this lifetime by turning the trait into By the way, I benchmarked this, and the worst impact on performance I was able to register was a 5% slowdown, which occurs on plain dtypes (record dtypes are basically unaffected). |
My use case was to read multidimensional tensors, so this limitation is also quite a showstopper. Passing the shape information around separately is so ugly and error prone, considering that it is available from the file. Could we not start simple: Only add a Edit: Looking over the documentation/code a bit more, I'm getting confused. The |
N.B. When I was working on this, @potocpav seemed to have been very busy with other things. At some point in the past, I began to get a bit impatient waiting for my PRs to move, and had determined that if I didn't hear back within another week that I would start fixing up examples and documentation and preparing to release my own fork under a new name... and then suddenly I heard back from the author on my PR. (but then he disappeared again!)
My memory is foggy but I definitely do recall seeing some confusing things regarding what is and what isn't public API. You may want to check out my fork and see if that API makes more sense. (try Edit 2021/07/01: I have now released https://crates.io/crates/npyz which has this feature and a lot more. |
Is there any particular reason why is the current implementation restricted to 1-dimensional arrays?
In particular, right now I want to use this crate to read a
bsr_matrix
saved byscipy.sparse.save_npz
. The output contains adata.npy
which has a header like this:I already have a type of my own that can represent this; all I really need from the
npy
crate are its shape and a rawVec<f64>
of all the data.More generally, I think that the reasonable behavior for this crate when
ndim != 1
is to continue doing exactly what it already does; give a flat, 1-dimensional iterator of the data's scalars in reading order, and haveto_vec()
produce a flat 1DVec<_>
. If anybody feels the need to make the dimensionality of the data more "substantial," they're free to inspect theshape
attribute and go nuts.The text was updated successfully, but these errors were encountered: