Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support array of &str / unicode_ / string #141

Closed
jorgecarleitao opened this issue Jul 13, 2020 · 7 comments · Fixed by #378
Closed

Support array of &str / unicode_ / string #141

jorgecarleitao opened this issue Jul 13, 2020 · 7 comments · Fixed by #378

Comments

@jorgecarleitao
Copy link

Like described, it would allow storing numpy native operations.

@adamreichold
Copy link
Member

Arrays containing PyString should already be supported as PyArray<PyObject>. Descriptors like <Un will probably need const generics though.

@adamreichold
Copy link
Member

The main problem with the <Un types that I see is that it is difficult to decide at compile time which T should be used in PyArray<T, D>. [Py_UCS4; N] would make sense from a memory layout point of view, but I don't think this too helpful as the maximum size of the elements is usually not know at compile time. But we do assume T: Sized in PyArray<T, D> so I do not really see how PyArray could be made to handle this case directly. 🤔

@rachtsingh
Copy link

This would be really helpful to us if anyone is attempting to do it. I can give it a shot as well.

but I don't think this too helpful as the maximum size of the elements is usually not know at compile time

In our case (writing a parser) we know the value (e.g. it's 2 or 12) at compile time. And we now have const generics.

@adamreichold
Copy link
Member

This would be really helpful to us if anyone is attempting to do it.

From my point of view this is basically blocked on #186 which is blocked on more versatile buffer protocol support in PyO3, c.f. #321

In our case (writing a parser) we know the value (e.g. it's 2 or 12) at compile time.

Are you sure that NumPy will not dynamically use a size between 2 or 12 if the strings in the array are all shorter than 12 code points? Do you set the dtype explicitly when constructing the arrays?

And we now have const generics.

I would still advise on waiting a bit until we bump our MSRV from 1.48 to most likely 1.56 when Debian Bookworm is released as you will need to provide some build.rs-#[cfg(..)]-boilerplate to disable the support on Rust 1.48.

Of course, creating a draft PR which works expect for the MSRV build might still be a good thing to start discussing the work. I think

impl<const N: usize> Element for [Py_UCS4; N] { .. }

and a test of your use case would be the minimum required?

@rachtsingh
Copy link

rachtsingh commented May 11, 2023

Hmm, I can't see a relation between supporting e.g. a U2 arrays and record types (which are, in my opinion, not very standard practice in the numpy world), but I could be easily missing the context.

Are you sure that NumPy will not dynamically use a size between 2 or 12 if the strings in the array are all shorter than 12 code points? Do you set the dtype explicitly when constructing the arrays?

NumPy doesn't appear to automatically compact the dtype of arrays, i.e. it supports the case where the dtype is U<M and all strings are of max size N < M:

In [9]: x = np.array(['a', 'b'], dtype='<U12')

In [10]: x
Out[10]: array(['a', 'b'], dtype='<U12')

As to explicitly setting the dtype: I'm not entirely sure what you mean, but we can; our explicit use case would be writing (in Rust using PyO3/maturin) a function that takes a file and returns a numpy array of type <U12; raising an error on encountering a string with >12 code points would be fine here.

I would still advise on waiting a bit until we bump our MSRV from 1.48 to most likely 1.56

Makes sense, thanks for the context. Do you have a vague guess as to when that'll be?

I think ... and a test of your use case would be the minimum required?

Sounds good. This isn't a priority for us right now so I can't promise this will be done with any haste, but thanks for pointing out where to start.

@adamreichold
Copy link
Member

Hmm, I can't see a relation between supporting e.g. a U2 arrays and record types (which are, in my opinion, not very standard practice in the numpy world), but I could be easily missing the context.

The main thing is that we need a more general integration with Python's buffer protocol where we do not assume that the element type T of PyArray<T, D> has a fixed size known at compile time. As written above [Py_UCS; N] can be supported right now, but it is a pretty limited form of supporting unicode arrays.

As to explicitly setting the dtype: I'm not entirely sure what you mean

I meant you did, pass dtype='<U12' to the np.array constructor. Without it, one gets e.g.

>>> np.array(['a', 'b'])
array(['a', 'b'], dtype='<U1')

where NumPy automatically chooses the smallest possible <Un dtype.

Do you have a vague guess as to when that'll be?

Debian Bookworm is expected to be released on 2023-06-10 which is what we are waiting for to bump our MSRV.

@adamreichold
Copy link
Member

@rachtsingh Maybe #378 is already useful for you? Merging will have to wait for our MSRV bump though as discussed above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants