-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting a buffer from a Unicode array uses invalid format #57281
Comments
In Python 3.2, when you get a buffer from array.array('u'), "u" is used as buffer format. The format is supposed to be a format from the struct module, and "u" is an invalid struct format. "w" is used on wide mode. I just upgraded the array module to use the new Unicode API (PEP-393). The array now uses a Py_UCS4 buffer. I used "I" or "L" format depending on the size of int and long C types. It would be better to use a format for a Py_UCS4 string, but struct doesn't support such type. For Python 2.7 and 3.2, I don't know if it is really a bug or not. |
The automatic conversion of 'u' to 'I' or 'L' causes test_buffer # Not implemented formats. Ugly, but inevitable. This is the same as
# issue python/cpython#46783: equality is also used for membership testing and must
# return a result.
a = array.array('u', 'xyz')
v = memoryview(a)
self.assertNotEqual(v, a)
self.assertNotEqual(a, v) I don't have a better idea though what to do about 'u' except |
PEP-3118 suggests for the extended struct syntax: 'c' -> ucs-1 (latin-1) encoding |
I don't understand: a buffer format is a format for the struct module, |
STINNER Victor <report@bugs.python.org> wrote:
It's like this: memoryview follows the current struct syntax, which This isn't so important, since I discovered (see my later post) So I think we should focus on whether the proposed 'c', 'u' and 'w' |
@Stefan: What is the status of this issue? |
I'm not sure what to do. Martin's opinion was that the change should http://mail.python.org/pipermail/python-dev/2012-March/117390.html |
Should we do something before Python 3.3 final? |
Is it possible without too much effort to keep the old behavior The problem with the current behavior is that it's neither backwards If it is too much work to restore the status quo, we could leave this |
Here is a patch reverting changes of the PEP-393, as suggested by Martin von Loewis. With the patch, array uses Py_UNICODE* type for the 'u' format. So array.array('u', '\u0010ffff')[0] should return '\uDBFF' on Windows. |
The diff between b9558df8cc58 and default with array_revert_pep393.patch I'm not sure why typecode was originally Py_UNICODE though. |
I just copied code from Python 3.2, I forgot to update typecode type |
array_revert_pep393-2.patch looks good (checked against 7042a83f37e |
@georg: are you ok with this change? It reverts the behaviour of Python 3.2 and avoids to have to maintain an API that nobody wants to use ('u' format using Py_UCS4, 32 bits unsigned). |
New changeset 95da47ddebe0 by Victor Stinner in branch 'default': |
Oops, the initial issue is not solved. Attached fixes the array == memoryview issue by using a valid format for the buffer. |
Hum, this issue is a regression from Python 3.2. I would like to see it fixed in Python 3.3. Example: Python 3.2.3+ (3.2:243ad1a6f638+, Aug 4 2012, 01:36:41)
[GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux2
>>> import array
>>> a=array.array('u', 'xyz')
>>> b=memoryview(a)
>>> a == b
True
>>> b == a
True |
Victor: the revert commit brought back "Python's Unicode character type" into the docs. This needs to be fixed to say "legacy" somewhere, as the characters in a normal Unicode string are not of that type anymore. |
STINNER Victor <report@bugs.python.org> wrote:
> Hum, this issue is a regression from Python 3.2.
>
> Python 3.2.3+ (3.2:243ad1a6f638+, Aug 4 2012, 01:36:41)
> [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux2
> >>> import array
> >>> a=array.array('u', 'xyz')
> >>> b=memoryview(a)
> >>> a == b
> True
> >>> b == a
> True [3.3 returns False] That's actually deliberate. The new memoryview does not consider arrays equal Python 3.2a0 (py3k:76143M, Nov 7 2009, 17:05:38)
[GCC 4.2.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import array
>>> a = array.array('f', [0])
>>> b = array.array('i', [0])
>>> x = memoryview(a)
>>> y = memoryview(b)
>>>
>>> a == b
True
>>> x == y
True
>>> I think that (for buffers at least) an array of float should not compare See also the comment in the documentation for memoryview.format: http://docs.python.org/dev/library/stdtypes.html#memoryview-type memoryview is not aware of the 'u' format code, since it's not part of Now in your example I see that array's getbufferproc actually already uses |
Also, it was suggested that 'u' should be deprecated: http://mail.python.org/pipermail/python-dev/2012-March/117392.html Personally, I don't have an opinion on that; I don't use the 'u' Nick, could you have a look at msg167545 and see if any action |
Of course, if two formats *are* the same, it is possible to use |
Perhaps if memoryview doesn't understand the format code, it can fall back on memcmp() if strcmp() indicates the format codes are the same? Otherwise we're at risk of breaking backwards compatibility with more than just array('u'). Also, if it isn't already, the change to take format codes into a account in memoryview comparisons should be mentioned in the What's New porting section. |
Did you see attached patch array_unicode_format.patch? It uses struct |
I totally overlooked that. Given that memoryview can be fixed to The only advantage for memoryview would be that tolist() etc. If we're deprecating 'u' and 'w' anyway, the getbufferproc should |
I think Victor's patch is a good solution to killing the 'u' and 'w' exports in 3.4, but we need to restore some tolerance for unknown format codes to memoryview in 3.3 regardless. |
I have a patch already for the unknown format codes in memoryview. |
Someone broke the Windows buildbots. |
New changeset e0f3406c43e4 by Victor Stinner in branch 'default': |
New changeset 67a994d5657d by Victor Stinner in branch 'default': |
And the test fails on machines without ctypes :) |
New changeset 4ee4cceda047 by Victor Stinner in branch 'default': |
Deferring. |
Is there anything that still needs to be done on this issue? ISTM that the code is correct as it stands (i.e. Getting a buffer now only uses valid format codes) |
There's still work to be done. The current status in 3.3 trunk is that: Wide build:
>>> memoryview(array("u")).format
'w'
Narrow build:
>>> memoryview(array("u")).format
'u' Neither of these are valid struct formats, thus they don't play nicely with the assumptions of memoryview (or any other PEP-3118 consumer). Stefan's memoryview changes are needed because there are *valid* struct formats that memoryview doesn't understand (yet), but it's only coincidental that they will reduce the severity of this problem. Victor's latest patch switches the 'w' and 'u' for the appropriate integer sizes 'I' and 'H' which I think is an excellent approach. There are also the post-reversion documentation changes Georg requested to bring the docs back into line with PEP-393 |
Why do you say that? They have been added by PEP-3118 (and are If you think that their mentioning in PEP-3118 is a mistake, If these codes are indeed meant to be in the struct module, I agree that it is then desirable that the memoryview object |
Adding a link to bpo-15625, which is discussing the other end of this issue (whether or not memorview should support 'u' as a typecode). |
Based on the discussion in bpo-15625, it seems that the consensus is to take no action on the format codes in this issue for 3.3, and reconsider in 3.4, to determine in what way the struct module should support Unicode. Instead, the 'u' array code will be deprecated, in the same way in which the rest of the Py_UNICODE API is deprecated. |
If everyone agrees on deprecating 'u', here's a patch. I think |
I think a proper deprecation warning is preferable. |
I guess the analogy with bytes objects is that UCS-2 code points can be If we're going to do a programmatic deprecation now, that's the only |
I don't understand. If you want to handle 16-bit integers, you already |
Since actual removal is only scheduled for 4.0, I think user warnings By then, we should have sorted out the struct format codes. In this |
Stefan, your patch array_deprecate_u.diff is fine. If you get to it, please also rephrase the clause "Python's unicode type"; not sure what the convention is to refer to Py_UNICODE now (perhaps "historical unicode type"). |
New changeset 9c7515e29219 by Stefan Krah in branch 'default': |
Good, I think this can be closed then. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: