Support array of &str / unicode_ / string #141

jorgecarleitao · 2020-07-13T12:33:54Z

Like described, it would allow storing numpy native operations.

adamreichold · 2022-03-22T17:52:18Z

Arrays containing PyString should already be supported as PyArray<PyObject>. Descriptors like <Un will probably need const generics though.

adamreichold · 2022-03-24T17:02:47Z

The main problem with the <Un types that I see is that it is difficult to decide at compile time which T should be used in PyArray<T, D>. [Py_UCS4; N] would make sense from a memory layout point of view, but I don't think this too helpful as the maximum size of the elements is usually not know at compile time. But we do assume T: Sized in PyArray<T, D> so I do not really see how PyArray could be made to handle this case directly. 🤔

rachtsingh · 2023-05-11T17:25:22Z

This would be really helpful to us if anyone is attempting to do it. I can give it a shot as well.

but I don't think this too helpful as the maximum size of the elements is usually not know at compile time

In our case (writing a parser) we know the value (e.g. it's 2 or 12) at compile time. And we now have const generics.

adamreichold · 2023-05-11T17:47:00Z

This would be really helpful to us if anyone is attempting to do it.

From my point of view this is basically blocked on #186 which is blocked on more versatile buffer protocol support in PyO3, c.f. #321

In our case (writing a parser) we know the value (e.g. it's 2 or 12) at compile time.

Are you sure that NumPy will not dynamically use a size between 2 or 12 if the strings in the array are all shorter than 12 code points? Do you set the dtype explicitly when constructing the arrays?

And we now have const generics.

I would still advise on waiting a bit until we bump our MSRV from 1.48 to most likely 1.56 when Debian Bookworm is released as you will need to provide some build.rs-#[cfg(..)]-boilerplate to disable the support on Rust 1.48.

Of course, creating a draft PR which works expect for the MSRV build might still be a good thing to start discussing the work. I think

impl<const N: usize> Element for [Py_UCS4; N] { .. }

and a test of your use case would be the minimum required?

rachtsingh · 2023-05-11T19:36:56Z

Hmm, I can't see a relation between supporting e.g. a U2 arrays and record types (which are, in my opinion, not very standard practice in the numpy world), but I could be easily missing the context.

Are you sure that NumPy will not dynamically use a size between 2 or 12 if the strings in the array are all shorter than 12 code points? Do you set the dtype explicitly when constructing the arrays?

NumPy doesn't appear to automatically compact the dtype of arrays, i.e. it supports the case where the dtype is U<M and all strings are of max size N < M:

In [9]: x = np.array(['a', 'b'], dtype='<U12')

In [10]: x
Out[10]: array(['a', 'b'], dtype='<U12')

As to explicitly setting the dtype: I'm not entirely sure what you mean, but we can; our explicit use case would be writing (in Rust using PyO3/maturin) a function that takes a file and returns a numpy array of type <U12; raising an error on encountering a string with >12 code points would be fine here.

I would still advise on waiting a bit until we bump our MSRV from 1.48 to most likely 1.56

Makes sense, thanks for the context. Do you have a vague guess as to when that'll be?

I think ... and a test of your use case would be the minimum required?

Sounds good. This isn't a priority for us right now so I can't promise this will be done with any haste, but thanks for pointing out where to start.

adamreichold · 2023-05-11T19:45:12Z

Hmm, I can't see a relation between supporting e.g. a U2 arrays and record types (which are, in my opinion, not very standard practice in the numpy world), but I could be easily missing the context.

The main thing is that we need a more general integration with Python's buffer protocol where we do not assume that the element type T of PyArray<T, D> has a fixed size known at compile time. As written above [Py_UCS; N] can be supported right now, but it is a pretty limited form of supporting unicode arrays.

As to explicitly setting the dtype: I'm not entirely sure what you mean

I meant you did, pass dtype='<U12' to the np.array constructor. Without it, one gets e.g.

>>> np.array(['a', 'b'])
array(['a', 'b'], dtype='<U1')

where NumPy automatically chooses the smallest possible <Un dtype.

Do you have a vague guess as to when that'll be?

Debian Bookworm is expected to be released on 2023-06-10 which is what we are waiting for to bump our MSRV.

adamreichold · 2023-05-14T18:27:10Z

@rachtsingh Maybe #378 is already useful for you? Merging will have to wait for our MSRV bump though as discussed above.

kngwyu added the enhancement label Jul 16, 2020

messense mentioned this issue Jul 11, 2021

Add unicode array support #196

Closed

messense mentioned this issue Mar 21, 2022

Upgrade pyo3 to 0.16 huggingface/tokenizers#956

Merged

adamreichold mentioned this issue May 14, 2023

Add support for arrays of ASCII and Unicode strings #378

Merged

adamreichold closed this as completed in #378 Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support array of &str / unicode_ / string #141

Support array of &str / unicode_ / string #141

jorgecarleitao commented Jul 13, 2020

adamreichold commented Mar 22, 2022

adamreichold commented Mar 24, 2022

rachtsingh commented May 11, 2023

adamreichold commented May 11, 2023

rachtsingh commented May 11, 2023 •

edited

Loading

adamreichold commented May 11, 2023

adamreichold commented May 14, 2023

Support array of &str / unicode_ / string #141

Support array of &str / unicode_ / string #141

Comments

jorgecarleitao commented Jul 13, 2020

adamreichold commented Mar 22, 2022

adamreichold commented Mar 24, 2022

rachtsingh commented May 11, 2023

adamreichold commented May 11, 2023

rachtsingh commented May 11, 2023 • edited Loading

adamreichold commented May 11, 2023

adamreichold commented May 14, 2023

rachtsingh commented May 11, 2023 •

edited

Loading