Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify version parsing example #571

Closed
wants to merge 2 commits into from
Closed

Conversation

techtonik
Copy link
Contributor

This removes regexp and adds import os.

File is read in cp437 because this encoding has no missing code points while converting to utf-8 string. Only ASCII set is used anyways.

This removes regexp and adds `import os`.

File is read in `cp437` because this encoding has no missing code points while converting to `utf-8` string. Only ASCII set is used anyways.
@theacodes
Copy link
Member

Thanks for doing this, @techtonik, but I find the new one much more difficult to read. Would it be possible to reduce the amount of nesting?

@techtonik
Copy link
Contributor Author

techtonik commented Nov 13, 2018

@theacodes nothing I can think of. It is either flatness or simplicity. For people who haven't learn what a regular expression is, this is already an advantage.

@techtonik
Copy link
Contributor Author

techtonik commented Nov 13, 2018

On the second thought my similar improvement for pip looks lighter - pypa/pip#6004 - but I still don't understand the logic behind codecs.open - documentation doesn't say anything about which encoding is used, so I guess iteration will fail if I have Russian comments in cp1251 in version file.

@theacodes
Copy link
Member

Yeah, let's go with pip's version.

@pradyunsg
Copy link
Member

There's a few changes we should make there IMO like using os.path.join and I think that currently (untested) that'll gives us the the string with quotes, so those have to be trimmed.

I'm super short on time, so can't write proper review comments on the PR, sorry!

@techtonik
Copy link
Contributor Author

@pradyunsg why use os.path.join in this specific case? It doesn't fail on Windows.

@techtonik
Copy link
Contributor Author

@theacodes can you explain what encoding is used by codecs.open to make sure it won't fail on files that contain Russian comments?

@theacodes
Copy link
Member

I honestly have no idea. I don't think it's possible for us to specify a general solution that'll work in every single case, so we should probably just use utf-8 but note that you might need to change it.

@pfmoore
Copy link
Member

pfmoore commented Nov 21, 2018

As it's only intended as example code, and it's reading a file that you create yourself, I think it's reasonable to not worry too much about encodings. If you're only supporting Python 3.2+, tokenize.open() respects an encoding declaration in the source file, which this code doesn't. But is it that important?

If you're not comfortable writing your own code based on the supplied example, I'd argue that you should probably be sticking to the simpler technique of just maintaining the version number in setup.py, as suggested in the tutorial anyway.

Basically, I agree, this change isn't much of an improvement, so I'd prefer to stick with what we have.

@techtonik
Copy link
Contributor Author

Ok. Let me tell you what happens with this code if it is run on Greek Windows when my version file contains Russian comment. The code from this guide will fail.

https://bitbucket.org/techtonik/hexdump/pull-requests/1/since-the-input-file-is-a-specific/diff

Needless to say that having a guide with magic that writers themselves don't understand says something about the language. Explicit cp437 is the best choice. I can explain.

@pfmoore
Copy link
Member

pfmoore commented Nov 21, 2018

So if you want to use a Russian comment, modify the code to handle it. Just add an explicit encoding to codecs.open. What's the problem?

Regarding "writers themselves don't understand", frankly few of us still recall all the bizarre contortions needed for Python 2/3 compatibility. I'd happily say replace codecs.open() with open() and leave people wanting Python 2 compatibility to work out how to do it. But you'd then have to change it back to use codecs.open, because Python 2's open doesn't support an encoding. So how's that an improvement?

@techtonik
Copy link
Contributor Author

So if you want to use a Russian comment, modify the code to handle it. Just add an explicit encoding to codecs.open. What's the problem?

The problem is that it is very hard to debug, non-intuitive and requires a sacred knowledge that doesn't hold in ones head as this thread shows. Python doesn't give any hints to make the inference process easier.

Regarding "writers themselves don't understand", frankly few of us still recall all the bizarre contortions needed for Python 2/3 compatibility.

It is only about Python 3. Python 2 is irrelevant, because it does not autoconvert files with occassional system encoding on reading. Replacing open() with codecs.open() doesn't fix the bug I linked above. Knowing encoding beforehand doesn't make the snippet universal. Either reading in binary or in universal encoding without missing code points to convert ASCII version to string is a universal solution.

@ncoghlan
Copy link
Member

The simplest and most appropriate clarification here is to just pass an explicit encoding='utf-8' to codecs.open in the existing example (since the Python docs are indeed unclear on what that defaults to in Python 3.x, in 2.7 it defaults to ASCII which isn't helpful for files that aren't encoded that way, and explicitly including the argument makes it clear where to specify the encoding if you use a non-utf-8 encoding for the file containing the version metadata in your particular project).

A comment could also be included saying that builtin open() can be used directly in projects that don't support Python 2.x at all (since 2.x support is the only reason the example uses codecs.open instead).

To avoid problems with unknown encoding, file with __version__
is processed as binary and the final number is converted to string
just before the function returns.
@techtonik
Copy link
Contributor Author

Until codecs.open documentation is clear I decided to parse everything as binary. This works Python 2.7 and Python 3.x the same way. As long as version itself is an ASCII string, everything else in a file doesn't matter.

@theacodes
Copy link
Member

I'm so sorry for letting this PR languish for so long.

Until codecs.open documentation is clear I decided to parse everything as binary. This works Python 2.7 and Python 3.x the same way. As long as version itself is an ASCII string, everything else in a file doesn't matter.

I'm not completely comfortable with this. I do like @ncoghlan's suggestion of using codecs.open and explicitly passing utf-8 with a comment noting that differing encodings may cause trouble.

@techtonik
Copy link
Contributor Author

@theacodes I am not comfortable with undocumented codecs.open behavior either.

@theacodes
Copy link
Member

Help me understand what undocumented behavior codecs.open presents? @ncoghlan and I are suggesting the following:

# Open the version file using utf-8 encoding. If you are using different encoding for your source
# file, change the encoding to match. See https://www.python.org/dev/peps/pep-0263/ for more
# details.
codecs.open(version_file, "r", encoding="utf-8"):

@techtonik
Copy link
Contributor Author

techtonik commented Feb 10, 2019

@theacodes the undocumented behavior is what happens if a person left encoding parameter for codecs.open unspecified. 45 minutes is a price that is too high for a missing param.

Behind the scenes codecs.open is opening a file in binary mode, so I don't see why not to just use binary matching without additional wrapper. Do you have any arguments?

techtonik added a commit to techtonik/setuptools that referenced this pull request Feb 10, 2019
This allows people who use setuptools to maintain project versions
in a single place, for example as attributes in their Python source.

Closes pypa#1316
Closes pypa/pip#6004
Closes pypa/packaging.python.org#571
@techtonik
Copy link
Contributor Author

I submitted the function to setuptools pypa/setuptools#1679

Spent 2 more hours polishing the code that I thought is final. If merged, it would be the most simple way to fetch version string, because I believe pip ships with setuptools by defaut.

@theacodes theacodes closed this Jul 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants