Add support for unicode Git tags/branches #2997

agjohnson · 2017-07-11T03:41:21Z

This replaces CSV parsing of the data, as CSV doesn't support unicode and is
mostly hacky.

This replaces CSV parsing of the data, as CSV doesn't support unicode and is mostly hacky.

ericholscher

Looks like a good change. I worry that it's going to change the behavior of how we handle branches, so we might end up with weird cases where active tags/branches are getting parsed differently, and end up with odd outcomes.

Definitely needs some docs that explain the approach the regex is taking, as I can't really parse it, and I imagine most people won't be able to.

Also looks like it might be breaking on Python 3, from the test failure, it seems to be ending up with an empty branch name, I think?

ericholscher · 2017-07-11T15:20:41Z

readthedocs/vcs_support/backends/git.py

+        (?:\n|$)
+        ''',
+        (re.VERBOSE | re.MULTILINE)
+    )


These could use some comments that explain them.

ericholscher · 2017-07-11T15:22:49Z

readthedocs/vcs_support/backends/git.py

-                    clean_branches.append(VCSVersion(self, branch, slug))
-        return clean_branches
+        branches = []
+        for match in self.BRANCH_REGEX.finditer(data):


I feel like we're depending on some regex magic here that isn't easy to understand. It would be good to document exactly how this is working. Are we still handling empty branches, and * names? Are they getting filtered out with regex magic?

Not sure what is meant by empty branches or * names. I only replicated what the csv parser was doing, no regex magic here, just basic regex to split on whitespace.

The previous code was doing branch = [f for f in branch if f != '' and f != '*'], which seemed to be filtering out things?

I'll add some test cases for this. I missed the '*', but empty is likely not being caught anyways.

agjohnson · 2017-07-12T22:33:29Z

Errors aren't from regex, they are more byte conversion bugs.

Code0x58

I like the idea of using using a library for interfacing with Git. The TODO mentioned GitPython which I recall being handy, but the limitations section of the documentation says it may introduce memory leaks, so would take some looking into/testing before anyone would want it running under long running gunicorn/celery processes.

Code0x58 · 2017-12-29T19:04:04Z

readthedocs/vcs_support/backends/git.py

+        branches = []
+        for match in self.BRANCH_REGEX.finditer(data):
+            branch = match.group('branch')
+            if branch.startswith('origin/HEAD'):


origin/HEAD-this-is-not would match, although unlikely, this should be an equality test.

Code0x58 · 2017-12-29T20:09:59Z

readthedocs/vcs_support/backends/git.py

-                    slug = branch.replace('/', '-')
-                    clean_branches.append(VCSVersion(self, branch, slug))
-        return clean_branches
+        # TODO consider replacing this with GitPython


I much prefer the idea of this as it has spared me making/thinking about code to interface with the CLI in the past.

Code0x58 · 2017-12-29T20:19:35Z

readthedocs/vcs_support/backends/git.py

+    BRANCH_REGEX = re.compile(
+        r'''
+        ^\s*                        # any amount of whitespace
+        (?P<branch>\w.+)            # an alpha-numeric character followed by anything


So this will match all of origin/HEAD -> origin/master (from the docstring of the parse_branches), rather than splitting it. I assume that was why the CSV reader was used previously (I can only imagine why .split(' ') wasn't used).

Code0x58 · 2017-12-29T20:26:15Z

readthedocs/vcs_support/backends/git.py

-                    clean_branches.append(VCSVersion(self, branch, slug))
-        return clean_branches
+        # TODO consider replacing this with GitPython
+        data = data.decode('utf-8')


I think this kind of thing should be as close to the source as possible, so in the this case the decode would be applied to stdout.

Code0x58 · 2017-12-29T20:28:55Z

readthedocs/vcs_support/backends/git.py

-            data = str(data)
-        raw_branches = csv.reader(StringIO(data), delimiter=' ')
-        for branch in raw_branches:
-            branch = [f for f in branch if f != '' and f != '*']


'*' and '' are excluded because of the regex (\w doesn't match '*', so it that split things properly then this would be okay.

ericholscher · 2018-05-30T21:16:04Z

Believe this can be closed, as we are now using Git python for this (#4052) -- however it still doesn't support unicode, so maybe we should steal the tests from here, and do something with them?

/cc @stsewd

stsewd · 2018-05-30T21:20:31Z

If the code is deployed we can try and see if the unicode support really works in production, the tests here are a little unrelated since those are for check the regex, but I borrow the unicode string from here to test #4052

stsewd · 2018-05-30T22:59:43Z

Just to notice that #4052 didn't work in production, I think is related to some environment variables, python3 is the final fix p:

agjohnson · 2018-06-07T16:01:53Z

Yup, i noted in the gitpython work that tests could be stolen from here, but gitpython is a better way to address the tag parsing. Feel free to close whenever.

agjohnson · 2018-06-07T16:04:03Z

@stsewd also, what env variables? seems we could probably fix this in prod if env vars are the culprit. Any way to reproduce the fix?

stsewd · 2018-06-07T16:09:13Z

@agjohnson my PR wasn't passing on the CI with py3, after reading about tox, I found that it removes a lot of environment variables and keep others, so I added this magic line f4c53c8

Here are the docs from tox https://tox.readthedocs.io/en/latest/config.html#confval-passenv=SPACE-SEPARATED-GLOBNAMES

humitos · 2018-08-16T01:54:42Z

I think we can close this PR since Unicode tags are already working on production. On the other hand, there is an specific PR for supporting unicode branches #4433 and also after the migration to Azure and Python3 next weekend we will support unicode branches without touching our code. (Python3 will solve this issue with the current code that it's deployed in production).

Add support for unicode Git tags/branches

29ef38b

This replaces CSV parsing of the data, as CSV doesn't support unicode and is mostly hacky.

agjohnson added the PR: work in progress Pull request is not ready for full review label Jul 11, 2017

agjohnson requested a review from ericholscher July 11, 2017 03:41

agjohnson added PR: ready for review and removed PR: work in progress Pull request is not ready for full review labels Jul 11, 2017

ericholscher requested changes Jul 11, 2017

View reviewed changes

Fix byte conversion and add docs

06c4e4b

Code0x58 reviewed Dec 29, 2017

View reviewed changes

agjohnson mentioned this pull request Mar 23, 2018

Simplify vcs_support backend git by using GitPython #3839

Closed

stsewd mentioned this pull request May 7, 2018

Use gitpython for tags #4052

Merged

agjohnson added Status: invalid and removed PR: ready for review labels Jun 7, 2018

agjohnson removed the Status: invalid label Jun 8, 2018

humitos mentioned this pull request Jul 26, 2018

Support git unicode branches #4433

Merged

stsewd closed this Aug 16, 2018

stsewd deleted the version-unicode-fixes branch August 16, 2018 04:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for unicode Git tags/branches #2997

Add support for unicode Git tags/branches #2997

agjohnson commented Jul 11, 2017

ericholscher left a comment •

edited

Loading

ericholscher Jul 11, 2017

ericholscher Jul 11, 2017

agjohnson Jul 12, 2017

ericholscher Jul 12, 2017

agjohnson Jul 13, 2017

agjohnson commented Jul 12, 2017

Code0x58 left a comment •

edited

Loading

Code0x58 Dec 29, 2017

Code0x58 Dec 29, 2017

Code0x58 Dec 29, 2017

Code0x58 Dec 29, 2017

Code0x58 Dec 29, 2017

ericholscher commented May 30, 2018

stsewd commented May 30, 2018

stsewd commented May 30, 2018

agjohnson commented Jun 7, 2018

agjohnson commented Jun 7, 2018

stsewd commented Jun 7, 2018

humitos commented Aug 16, 2018

Add support for unicode Git tags/branches #2997

Add support for unicode Git tags/branches #2997

Conversation

agjohnson commented Jul 11, 2017

ericholscher left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agjohnson commented Jul 12, 2017

Code0x58 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher commented May 30, 2018

stsewd commented May 30, 2018

stsewd commented May 30, 2018

agjohnson commented Jun 7, 2018

agjohnson commented Jun 7, 2018

stsewd commented Jun 7, 2018

humitos commented Aug 16, 2018

ericholscher left a comment •

edited

Loading

Code0x58 left a comment •

edited

Loading