-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for unicode Git tags/branches #2997
Conversation
This replaces CSV parsing of the data, as CSV doesn't support unicode and is mostly hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good change. I worry that it's going to change the behavior of how we handle branches, so we might end up with weird cases where active tags/branches are getting parsed differently, and end up with odd outcomes.
Definitely needs some docs that explain the approach the regex is taking, as I can't really parse it, and I imagine most people won't be able to.
Also looks like it might be breaking on Python 3, from the test failure, it seems to be ending up with an empty branch name, I think?
(?:\n|$) | ||
''', | ||
(re.VERBOSE | re.MULTILINE) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could use some comments that explain them.
clean_branches.append(VCSVersion(self, branch, slug)) | ||
return clean_branches | ||
branches = [] | ||
for match in self.BRANCH_REGEX.finditer(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we're depending on some regex magic here that isn't easy to understand. It would be good to document exactly how this is working. Are we still handling empty branches, and *
names? Are they getting filtered out with regex magic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is meant by empty branches or *
names. I only replicated what the csv parser was doing, no regex magic here, just basic regex to split on whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code was doing branch = [f for f in branch if f != '' and f != '*']
, which seemed to be filtering out things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some test cases for this. I missed the '*', but empty is likely not being caught anyways.
Errors aren't from regex, they are more byte conversion bugs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of using using a library for interfacing with Git. The TODO mentioned GitPython which I recall being handy, but the limitations section of the documentation says it may introduce memory leaks, so would take some looking into/testing before anyone would want it running under long running gunicorn/celery processes.
branches = [] | ||
for match in self.BRANCH_REGEX.finditer(data): | ||
branch = match.group('branch') | ||
if branch.startswith('origin/HEAD'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
origin/HEAD-this-is-not
would match, although unlikely, this should be an equality test.
slug = branch.replace('/', '-') | ||
clean_branches.append(VCSVersion(self, branch, slug)) | ||
return clean_branches | ||
# TODO consider replacing this with GitPython |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I much prefer the idea of this as it has spared me making/thinking about code to interface with the CLI in the past.
BRANCH_REGEX = re.compile( | ||
r''' | ||
^\s* # any amount of whitespace | ||
(?P<branch>\w.+) # an alpha-numeric character followed by anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this will match all of origin/HEAD -> origin/master
(from the docstring of the parse_branches), rather than splitting it. I assume that was why the CSV reader was used previously (I can only imagine why .split(' ')
wasn't used).
clean_branches.append(VCSVersion(self, branch, slug)) | ||
return clean_branches | ||
# TODO consider replacing this with GitPython | ||
data = data.decode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this kind of thing should be as close to the source as possible, so in the this case the decode would be applied to stdout.
data = str(data) | ||
raw_branches = csv.reader(StringIO(data), delimiter=' ') | ||
for branch in raw_branches: | ||
branch = [f for f in branch if f != '' and f != '*'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'*'
and ''
are excluded because of the regex (\w
doesn't match '*'
, so it that split things properly then this would be okay.
If the code is deployed we can try and see if the unicode support really works in production, the tests here are a little unrelated since those are for check the regex, but I borrow the unicode string from here to test #4052 |
Just to notice that #4052 didn't work in production, I think is related to some environment variables, python3 is the final fix p: |
Yup, i noted in the gitpython work that tests could be stolen from here, but gitpython is a better way to address the tag parsing. Feel free to close whenever. |
@stsewd also, what env variables? seems we could probably fix this in prod if env vars are the culprit. Any way to reproduce the fix? |
@agjohnson my PR wasn't passing on the CI with py3, after reading about tox, I found that it removes a lot of environment variables and keep others, so I added this magic line f4c53c8 Here are the docs from tox https://tox.readthedocs.io/en/latest/config.html#confval-passenv=SPACE-SEPARATED-GLOBNAMES |
I think we can close this PR since Unicode tags are already working on production. On the other hand, there is an specific PR for supporting unicode branches #4433 and also after the migration to Azure and Python3 next weekend we will support unicode branches without touching our code. (Python3 will solve this issue with the current code that it's deployed in production). |
This replaces CSV parsing of the data, as CSV doesn't support unicode and is
mostly hacky.