Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for unicode Git tags/branches #2997

Closed
wants to merge 2 commits into from

Conversation

agjohnson
Copy link
Contributor

This replaces CSV parsing of the data, as CSV doesn't support unicode and is
mostly hacky.

This replaces CSV parsing of the data, as CSV doesn't support unicode and is
mostly hacky.
@agjohnson agjohnson added the PR: work in progress Pull request is not ready for full review label Jul 11, 2017
@agjohnson agjohnson requested a review from ericholscher July 11, 2017 03:41
@agjohnson agjohnson added PR: ready for review and removed PR: work in progress Pull request is not ready for full review labels Jul 11, 2017
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good change. I worry that it's going to change the behavior of how we handle branches, so we might end up with weird cases where active tags/branches are getting parsed differently, and end up with odd outcomes.

Definitely needs some docs that explain the approach the regex is taking, as I can't really parse it, and I imagine most people won't be able to.

Also looks like it might be breaking on Python 3, from the test failure, it seems to be ending up with an empty branch name, I think?

(?:\n|$)
''',
(re.VERBOSE | re.MULTILINE)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could use some comments that explain them.

clean_branches.append(VCSVersion(self, branch, slug))
return clean_branches
branches = []
for match in self.BRANCH_REGEX.finditer(data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we're depending on some regex magic here that isn't easy to understand. It would be good to document exactly how this is working. Are we still handling empty branches, and * names? Are they getting filtered out with regex magic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is meant by empty branches or * names. I only replicated what the csv parser was doing, no regex magic here, just basic regex to split on whitespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code was doing branch = [f for f in branch if f != '' and f != '*'], which seemed to be filtering out things?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some test cases for this. I missed the '*', but empty is likely not being caught anyways.

@agjohnson
Copy link
Contributor Author

Errors aren't from regex, they are more byte conversion bugs.

Copy link
Contributor

@Code0x58 Code0x58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of using using a library for interfacing with Git. The TODO mentioned GitPython which I recall being handy, but the limitations section of the documentation says it may introduce memory leaks, so would take some looking into/testing before anyone would want it running under long running gunicorn/celery processes.

branches = []
for match in self.BRANCH_REGEX.finditer(data):
branch = match.group('branch')
if branch.startswith('origin/HEAD'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

origin/HEAD-this-is-not would match, although unlikely, this should be an equality test.

slug = branch.replace('/', '-')
clean_branches.append(VCSVersion(self, branch, slug))
return clean_branches
# TODO consider replacing this with GitPython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I much prefer the idea of this as it has spared me making/thinking about code to interface with the CLI in the past.

BRANCH_REGEX = re.compile(
r'''
^\s* # any amount of whitespace
(?P<branch>\w.+) # an alpha-numeric character followed by anything
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will match all of origin/HEAD -> origin/master (from the docstring of the parse_branches), rather than splitting it. I assume that was why the CSV reader was used previously (I can only imagine why .split(' ') wasn't used).

clean_branches.append(VCSVersion(self, branch, slug))
return clean_branches
# TODO consider replacing this with GitPython
data = data.decode('utf-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of thing should be as close to the source as possible, so in the this case the decode would be applied to stdout.

data = str(data)
raw_branches = csv.reader(StringIO(data), delimiter=' ')
for branch in raw_branches:
branch = [f for f in branch if f != '' and f != '*']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'*' and '' are excluded because of the regex (\w doesn't match '*', so it that split things properly then this would be okay.

@ericholscher
Copy link
Member

Believe this can be closed, as we are now using Git python for this (#4052) -- however it still doesn't support unicode, so maybe we should steal the tests from here, and do something with them?

/cc @stsewd

@stsewd
Copy link
Member

stsewd commented May 30, 2018

If the code is deployed we can try and see if the unicode support really works in production, the tests here are a little unrelated since those are for check the regex, but I borrow the unicode string from here to test #4052

@stsewd
Copy link
Member

stsewd commented May 30, 2018

Just to notice that #4052 didn't work in production, I think is related to some environment variables, python3 is the final fix p:

@agjohnson
Copy link
Contributor Author

Yup, i noted in the gitpython work that tests could be stolen from here, but gitpython is a better way to address the tag parsing. Feel free to close whenever.

@agjohnson
Copy link
Contributor Author

@stsewd also, what env variables? seems we could probably fix this in prod if env vars are the culprit. Any way to reproduce the fix?

@stsewd
Copy link
Member

stsewd commented Jun 7, 2018

@agjohnson my PR wasn't passing on the CI with py3, after reading about tox, I found that it removes a lot of environment variables and keep others, so I added this magic line f4c53c8

Here are the docs from tox https://tox.readthedocs.io/en/latest/config.html#confval-passenv=SPACE-SEPARATED-GLOBNAMES

@humitos
Copy link
Member

humitos commented Aug 16, 2018

I think we can close this PR since Unicode tags are already working on production. On the other hand, there is an specific PR for supporting unicode branches #4433 and also after the migration to Azure and Python3 next weekend we will support unicode branches without touching our code. (Python3 will solve this issue with the current code that it's deployed in production).

@stsewd stsewd closed this Aug 16, 2018
@stsewd stsewd deleted the version-unicode-fixes branch August 16, 2018 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants