Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate our git branch/merge strategy #2011

Closed
abrokenjester opened this issue Mar 18, 2020 · 22 comments · Fixed by #2053, #2060 or #2583
Closed

Consolidate our git branch/merge strategy #2011

abrokenjester opened this issue Mar 18, 2020 · 22 comments · Fixed by #2053, #2060 or #2583
Assignees
Labels
🔧 internal task a project workflow / setup task not directly impacting end users

Comments

@abrokenjester
Copy link
Contributor

abrokenjester commented Mar 18, 2020

Three different merge strategies for branches / pull requests are currently available on the RDF4J repository:

Screenshot from 2020-03-18 13-43-05

Unfortunately, we are currently not consistent among ourselves what merge strategy should be used in what situation, and this makes our commit history a little messy.

In particular, I've observed several patterns in our commit history. For PRs merged by myself , using rebase merge results in a clean linear history on the master branch, for example:

Screenshot from 2020-03-18 13-55-30

While cleanly linear, a downside is that the link with the original PR is lost (though that can be retrieved by visiting the issue on Github, if necessary).

However, for PRs by several others, 'create a merge commit' is used, which results in pair of commits (one of them a merge commit) on the master branch history, instead of a single commit. For example:

Screenshot from 2020-03-18 13-57-46

In this approach the link with the original PR is retained in the commit history itself, but it does make the history 'noisy' and less straightforward to read.

Possible approaches to more consistency

While either approach to merging is workable in and of itself, having both used interchangeably breeds confusion, especially in trying to figure out where things came from, and what is included on which branch. I see two options to resolve this:

Option 1: use 'create a merge commit' exclusively

We could accept that a completely clean/linear history is not worth the hassle and switch back to using merge commit exclusively to merge PRs.

Consequences of this approach would be that we disable both the 'squash and merge' as well as the 'rebase' option. It means a slightly messier history, but provided we still squash PRs before merging, it's manageable.

This raises a potential problem: even if we among ourselves agree to "be good" and properly squash, third party contributors could be unaware of this and use merge commits on their branch before putting up a PR. I'd hate to get in a situation where someone has gone through the effort of providing a patch and we can't accept it because their PR is hard to reconcile.

Option 2: use 'rebase and merge' exclusively for feature PRs

We could accept that making a bit of effort to get a clean linear history in which each commit message immediately tells why a particular thing was changed is worth it. In this scenario, I'd suggest we use rebase merge exclusively to merge feature branches into master/develop, though we'd continue to use merge commits to keep master and develop in sync (purely because I've found that using rebase for this is a recipe for trouble, especially when conflicts occur).

In this scenario, we could optionally keep 'squash and merge' available, but only to be used as a backup option, when normal squashing and rebasing somehow are problematic due to conflicts. This would also make accepting third party PRs a little easier, even if they cannot easily be "manually" squashed.

On squashing

Regardless of which strategy we choose, we still need to squash branches before merging! Ideally when a feature branch is done it should not contain more than one or two commits, each with a meaningful commit message that includes the issue number and a specific description. Since, in either approach, you will be required to squash manually (as the squash and merge option will be either disabled or allowed only as a backup option), you need to make sure that when you sync your feature branch with the base branch to bring it up to date, you use rebase, not a merge commit! The reason for this is that if you use merge commits on your feature branch, squashing becomes a lot more difficult.

Discussion

I am looking for input from @hmottestad, @barthanssens, @aschwarte10, and anyone else who regularly contributes to this repository. I am fine with either approach, personally, as long as we are consistent about what we do. Please vote on the two options by adding an emoji reaction below (if you don't care / have no preference, please vote on both so I know you've seen it). Further feedback and comments, including alternative approaches, also welcome of course.

@abrokenjester abrokenjester added needs discussion 🔧 internal task a project workflow / setup task not directly impacting end users labels Mar 18, 2020
@abrokenjester abrokenjester self-assigned this Mar 18, 2020
@abrokenjester
Copy link
Contributor Author

React to this comment with 👍 if you prefer option 1: 'merge commits'.

@abrokenjester
Copy link
Contributor Author

abrokenjester commented Mar 18, 2020

React to this comment with 👍 if you prefer option 2: 'rebase'.

@hmottestad
Copy link
Contributor

hmottestad commented Mar 18, 2020

I’m still not sure what is wrong with “squash and merge”?

@abrokenjester
Copy link
Contributor Author

abrokenjester commented Mar 18, 2020

I’m still not sure what is wrong with “squash and merge”?

To be honest I keep going back and forth on this myself. The main problem I see with it is that it puts the responsibility of creating a meaningful commit message on the person merging the PR, rather than the author of the PR. Now, you can argue that since 9 out of 10 times you merge your own PRs, it's fine for you to use it, and up to a point that's true. I just think that it should be the responsibility of the PR author to make sure the PR is ready to be merged by any maintainer, so that we can approve the final version of the PR (including all commit messages), and only then merge.

But perhaps I'm being too precious about this. I can't deny the convenience of squash and merge. If people prefer, I'm happy to modify option 2 slightly to "use either rebase or squash and merge as appropriate".

@hmottestad
Copy link
Contributor

How does the github "rebase and merge" handle it if you have previously merged master into your branch and fixed some conflicts in that merge? If I do that locally and try to rebase onto master, it's like those conflicts were never fixed in the first place.

@abrokenjester
Copy link
Contributor Author

How does the github "rebase and merge" handle it if you have previously merged master into your branch and fixed some conflicts in that merge? If I do that locally and try to rebase onto master, it's like those conflicts were never fixed in the first place.

That's likely because you are mixing merge and rebase. Let me try and illustrate with some bad ASCII art examples. Disclaimer: I am not claiming full technical accuracy here on how rebase and merge actually work, I'm just illustrating the effects.

Let's say we have commit A, which is the head of the master branch, and commit B, which is the head of your feature branch:

(master) - A  
           |     
           +----> B - (feature)

Now, on master a new commit is done, and on your feature branch another commit is done as well:

           A  
           |      
           +----> B 
           |      |
(master) - C      D - (feature)

Now suppose you want to sync your feature branch up with master, but C and D conflict. You have two ways to sync: merge, or rebase.

Let's look what happens when you merge, first:

           A  
           |      
           +----> B 
           |      |
(master) - C      D 
           |      |
           +----> E - (feature) 

E is a new commit (a merge-commit) on your feature branch, as part of this commit you have resolved the conflict.

Now imagine it's time to merge your feature branch back to master. We use rebase to do this. How rebase works is that it looks at all the commits on your branch and tries to replay them, one by one, on the target branch. So focusing just on the master branch, rebase starts off by adding commit B:

           A  
           |      
           C
           |
           B

So far so good. Next is commit D. Problem. D conflicts with C. We have previously resolved that conflict in commit E, but we're not there yet - we're rebasing so we're first trying to apply D, and run into the same conflict you resolved earlier, again. This, or likely some variant of it, is where you run into trouble.

Now let's step back to the point where you want to sync your feature branch with master, but this time, use rebase. We have this situation:

           A  
           |      
           +----> B 
           |      |
(master) - C      D - (feature)

If we use rebase instead of merging, we are not creating a new merge commit. Instead, we are just shifting pointers. Effectively we are "rewriting history" to say we never branched off after commit A, but only after commit C:

           A  
           |
(master) - C
           |
           +----> B 
                  |
                  D - (feature)

But wait! Didn't C and D conflict? Why yes they did. So, in the course of this rebase, this is detected, and you resolve the conflict as part of the rebase operation. Typically what happens is that the "original" commit D is replaced with a "conflict-resolved" commit, let's call it D2:

           A  
           |
(master) - C
           |
           +----> B 
                  |
                  D2 - (feature)

Now, when you are are ready to rebase your feature branch back into master, you just get:

           A  
           |      
           C
           |
           B
           |
(master) - D2

No need to resolve the conflict again, because D2 already resolves the conflict.

Sorry for the long-winded explanation, but I hope this effectively illustrates why, if you choose to use rebase, you should use it consistently, and not mix it up with the occasional merge commit (even if it is just to sync up your feature branch with master).

@hmottestad
Copy link
Contributor

So in short it doesn’t.

If we go with rebase we also commit to using rebase + force push instead of a merge to “sync up” to the underlying branch.

Here we also give up collaboration on a branch. Where two people work on a branch together.

I’ve used rebase at work for around 3 years. Pretty much everyone messes it up the first time they use it. I have accidentally force pushed master, and others have accidentally rebased master onto their own branch. It’s a great way to clean up a branch though and a great way to have a nice clean history.

Personally I don’t care that much about having a nice clean history, especially since we aren’t cherry picking much anymore. The most important thing git brings for me is the ability to see the history for a line of code in a file to see why it came to be. Having good commit messages, small commits and references to the GitHub issue are the most important here. Squashing all commits and squash-and-merge both ruin this for any PRs larger than a minor bug fix.

@abrokenjester
Copy link
Contributor Author

abrokenjester commented Mar 19, 2020

If we go with rebase we also commit to using rebase + force push instead of a merge to “sync up” to the underlying branch.

Correct.

Here we also give up collaboration on a branch. Where two people work on a branch together.

True. But then again I feel we don't do this very often in the first place - most of the time in RDF4J, a feature branch is maintained by a single developer. And in situations where there is a need to cooperate on a feature, you could coordinate that by having each collaborator have their own branch, and just regularly rebase between those two branches. Here's a good article that explains this workflow: https://www.oddbird.net/2017/1/5/git-rebase/

Having said all that, I do agree with you that this is a model that we can only adopt if everyone understands it and is on board with it. For that reason alone (and especially since, as an open source project, we should make the threshold for third party contributions as low as possible) it might be wiser to stick to merge commits.

Personally I don’t care that much about having a nice clean history, especially since we aren’t cherry picking much anymore. The most important thing git brings for me is the ability to see the history for a line of code in a file to see why it came to be. Having good commit messages, small commits and references to the GitHub issue are the most important here. Squashing all commits and squash-and-merge both ruin this for any PRs larger than a minor bug fix.

Like I said, regardless of which merge strategy we use, I'd still advocate using squashing a feature branch before final merge. Feature branches typically contain loads of little commits with messages like 'typo' or 'fixed testcase blah' or whatever - which is perfectly fine to have during feature development, but once the feature is done, such commits provide no value, and in fact hamper figuring out what commits belong to which feature when, for example, git-blaming something. The thing that is important is the link to the issue and a short description of the overall feature/fix.

Note that squashing doesn't necessarily mean "squash everything into one big commit". It can be perfectly fine to squash down to two, three, heck even ten commits if it makes sense to keep distinguishing them. As long as they are all linked to the issue and descriptive I have no problem with it.

Glancing at the votes, it appears we are tied, I make it 3-3. Bart and myself are on the fence, Andreas is in favor of rebase, Havard is in favor of going back to merge commits.

Looks like I may have to come down off that fence and force a decision. Given current insights I am slightly leaning in favor of going back to merge-commits with branch squashing as appropriate before final PR merge. Any further insights? Anyone wants to change their opinion based on this discussion?

@abrokenjester
Copy link
Contributor Author

Alright. Thanks everyone for your input.

We'll go back to merge commits only. I'll disable the 'rebase merge' and 'squash and merge' options so we have no further confusion.

I will start encouraging everybody to be a little more diligent about writing meaningful commit messages and including the issue link, and to make sure trivial "fixed typo" type commits are squashed. Please don't take it personally if I ask you to squash and clean up before we merge a PR.

@abrokenjester
Copy link
Contributor Author

abrokenjester commented Mar 31, 2020

I'm sorry to keep going back and forth on this, but: I'm realizing that I really do miss the convenience of having a clean history to use in blaming/bisecting, or indeed cherry-picking (not something we do much now, but it might become relevant again if we get closer to a major release, where we have to keep an older release maintained in parallel). This is perhaps not something that's very relevant to individual committers, but I find it really useful as a project maintainer, to keep track of things.

I realize, however, that asking contributors to squash their branches manually is just too much of a hindrance. Nor do I want to mandate that people always use git rebase to keep their feature branches up to date.

Given that most of you are OK with any strategy we choose, as long as we choose a strategy, how's this, as a compromise:

  1. For pull requests that are ready to be merged (approved, no conflicts), we use the following strategy:

    • by default we use "Squash and merge" to merge feature PRs. When doing this, we use the Github UI to edit the commit message if necessary, making sure it starts with the issue number, and has a meaningful title and description (ideally with a bullet list detailing the changes - something which the squash and merge button automatically creates in most cases).
    • if you want to use "rebase and merge" to merge the PR rather than squash-and-merge, that's fine, but only do so if the number of commits in the PR is not above, say, three or four, and all of them are meaningfully described (issue number, explanatory text) and self-contained (e.g. it's not the case that commit 1 introduces a compile error that commit 2 then fixes).
    • we do not use merge commits to merge feature PRs.
  2. for keeping the master and develop branches in sync with each other, we continue using merge commits.

  3. If you want to use manual squashing/rebasing while working on your feature branch, then you can do so (keeping in mind that you really shouldn't if you are working on a collaborative branch). But it's also fine to just use merge commits to keep your feature branch up to date. Your choice.

We still use meaningful commit messages with the github issue number in them as much as possible. That doesn't change. If we do this right, it will make the chore of editing the commit message at PR merge time a lot simpler.

I was previously against using "Squash and merge", mostly because I thought it would encourage sloppy habits on a feature branch. But the fact that it's a form of merge that allows us to have a clean history without requiring that people manually squash and rebase all the time (which @hmottestad rightly said most people get wrong a lot), I think it makes sense as a compromise.

Like I said, I apologize for continuing to change direction on you. However, unless I'm much mistaken, this strategy actually makes your life a little easier as well, because there is no longer a need to squash/rebase your PRs manually if you don't want to. All I'm asking is that you're a little careful to pick the right strategy when merging your PR, and make sure the resulting commit has a descriptive message.

@hmottestad, @barthanssens , @aschwarte10 I'm looking at you in particular since you are the most prolific PR authors: is what I'm proposing clear, and workable? And not to pressure you but I'm really hoping that you say yes and I can finally shut up about this :)

@hmottestad
Copy link
Contributor

I don think "we do not use merge commits to merge feature PRs." should be a hard requirement. I actually think merge commits are a gazillion times better than rebase-and-merge. Why github doesn't distinguish between "commits that are on other branches where the branch was merged" and "commits on this branch" in their view is beyond me. I use Sourcetree for most of my git work, and it's a lot nicer to see a branch being merged into master rather than seeing 4 commits on master.

I like the "squash and merge" button for smaller things since you can't mess it up. Press the button, it squashes and merges, no extra conflicts or problems or ways to mess it up.

Other than that I prefer merge commits.

@abrokenjester
Copy link
Contributor Author

I understand your personal preferences. Nevertheless: is the compromise I proposed workable for you, at least? I'm juggling different requirements and preferences here and we need a solution that is perhaps not anyone's ideal way, but at least doable and consistent. I'm trying to avoid a git history where half of us use rebase and half use merge commits.

@hmottestad
Copy link
Contributor

I don't think we should allow rebase and merge. It makes the history harder to read since you now have multiple commits on master for a single feature.

I would like to recommend

  • use "Squash and merge" for most PRs
  • use "merge" only if you are willing to rebase your branch manually to make sure that history is perfectly linear
  • use "merge" when merging master into develop

I also believe that we need these rules to be more or less automated. So that if you haven't rebased your branch, the merge button isn't active. And that the "squash and merge" button shouldn't be allowed when merging master into develop.

Would also be nice to know what other projects are doing?

@hmottestad
Copy link
Contributor

Btw. Some people discussing "rebase and merge" feature: isaacs/github#1143

@abrokenjester
Copy link
Contributor Author

abrokenjester commented Mar 31, 2020

I don't think we should allow rebase and merge. It makes the history harder to read since you now have multiple commits on master for a single feature.

That's actually a very good point, and it's part of what I'm trying to avoid by recommending squash and merge as the default strategy: ideally, a single feature should correspond to a single commit on the master/develop branch.

The only reason I left the option to rebase and merge in is that I wanted to allow for the corner case where a PR has multiple commits and you don't want them to end up as a single commit in master/develop.

I would like to recommend

  • use "Squash and merge" for most PRs

Agreed.

  • use "merge" only if you are willing to rebase your branch manually to make sure that history is perfectly linear

So if I understand correctly you're proposing this as the strategy for PRs with multiple commits where you want to avoid everything being squashed, instead of using rebase and merge (which is what I proposed)?

The only difference I see is that if we have this situation:

           A  
           |      
           +----> B 
           |      |
(master) - C      D 
                  |
                  E - (feature)

using rebase-and-merge we'd end up with this:

           A  
           |      
           C 
           |      
           B  
           |      
           D
           |      
(master) - E  

and with your proposed strategy (rebase branch followed by merge) we'd get this:

           A  
           | 
           C  
           |      
           +----> B 
           |      |
           |      D 
           |      |
           |      E 
           |      |
(master) - F <----+  

Do I understand that correctly? Why do you find this preferable over the rebase-and-merge approach (which is completely linear)? My point being that the rebase-and-merge is also more consistent with the notion of using squash-and-merge as the default, which would look like this for this situation:

           A  
           |      
           C 
           |      
(master) - BDE  
  • use "merge" when merging master into develop

Agreed.

I also believe that we need these rules to be more or less automated. So that if you haven't rebased your branch, the merge button isn't active.

I believe that's doable by enabling 'require branches to be up to date before merging' status check. I'll investigate that.

And that the "squash and merge" button shouldn't be allowed when merging master into develop.

I'd love it if that were possible, but I don't believe Github allows such fine-grained control. We can disable merge commits on protected branches (basically enforcing that we can only use squash and merge or rebase), but I don't think we can enforce "no squash and merge" on develop.

Would also be nice to know what other projects are doing?

I believe the git world is firmly split in two roughly equal camps. The Bigendians prefer an accurate history, the Littleendians prefer a clean one.

@hmottestad
Copy link
Contributor

Rebase and merge removes the “this was initially a feature branch” information. The regular merge commit adds value. It’s lets me know that some commits actually belong together.

@abrokenjester
Copy link
Contributor Author

Rebase and merge removes the “this was initially a feature branch” information. The regular merge commit adds value. It’s lets me know that some commits actually belong together.

I see. My thinking here was that if commits "belong together" they should be squashed into one. A meaningful commit is self contained.

I'm also worried that we can't enforce rebase of a branch before merge commit. If you merge the base into your branch instead of using rebase, the history still won't be linear.

But I can live with the approach as a compromise if we can ensure that it definitely is the exception and not the rule, and we strive to use squash and merge as often as possible.

@hmottestad
Copy link
Contributor

hmottestad commented Mar 31, 2020

Thank you @jeenbroekstra . I really hope we can figure out a way to make this "automated" in some manner. Maybe start by adding a note to the pull-request template?

@abrokenjester
Copy link
Contributor Author

abrokenjester commented Mar 31, 2020

Unfortunately there is very little we can do to really automate any of this. The idea I had earlier to use branch protection to ensure a branch is rebased won't work, and will in fact hinder more than it helps (since it will require it for all PRs, even the ones where we would simply use squash and merge anyway).

I still want to make it clear that the intent is to have as few merge commits as possible. We are striving for a clean linear history here. I've iterated the reasons before, but here's a good blog post from a different team about the advantages of the strategy:

https://blog.dnsimple.com/2019/01/two-years-of-squash-merge/

How I will formalize/communicate our strategy is as follows:

  1. I will disable the 'rebase and merge' option.
  2. I will add a note to the pull request template to indicate that all pull requests will be merged using squash-and-merge, and a link to a page explaining our merge strategy in more detail.
  3. on that merge strategy page, I will make the following points:
    • we value a clean linear history over a detailed, accurate history
    • we define a self-contained change as a change that addresses only a single issue, addresses it completely, and does not also address other issues. We expect every pull request to be a self-contained change (note that does not mean a single commit - a PR can have several commits that together form a self-contained change).
    • we merge every pull request using squash and merge. Since the PR is self-contained, the resulting squashed merge will also be self-contained.
    • the only exception is the pull request that keeps master and develop in sync: that will always be merged via a merge commit.

I will address some common concerns, such as your concern that sometimes it helps to see the original individual commits. This is partly countered by the 'self-contained' argument, but also: the individual commits are still retained in the pull request page, which is preserved on github even after the PR is closed, and is linked both in the commit message after merging, and on the original github issue as well. In the rare cases where this level of detail is relevant, you can therefore still find it, even after things have been squashed.

I will also add an explanation that in very rare cases, by exception, we allow a pull request to be merged using a merge commit. This will only be done if the following conditions are all met:

  • the author explicitly comments on the pull request that this is necessary (and why), and;
  • the author can show that the PR has been rebased (not merged!) so that the result of the merge will be near-linear, and;
  • the project lead has given explicit approval of the intent to use a merge commit. The project lead reserves the right to overrule and use squash-and-merge in the interest of preserving a clean history.

Of course, this explanation is really only of relevance to the people with merge privileges. PRs by third party contributors will, as a rule, always be merged using squash and merge.

abrokenjester added a commit to eclipse-rdf4j/rdf4j-doc that referenced this issue Mar 31, 2020
- added links between workflow and merge strategy page
- updated workflow in several places to conform to current understanding
abrokenjester added a commit that referenced this issue Mar 31, 2020
- added checklist to help authors
- added links to more detailed explanation of merge strategy on website
abrokenjester added a commit to eclipse-rdf4j/rdf4j-doc that referenced this issue Apr 1, 2020
- added links between workflow and merge strategy page
- updated workflow in several places to conform to current understanding
@abrokenjester
Copy link
Contributor Author

PRs up for a change to the PR template, and updates of the rdf4j developer documentation.

@abrokenjester
Copy link
Contributor Author

Oh there is of course one more exception to squash and merge, and that is the pull request that merges a release branch back into master after release (it's new that we're even using a release branch for instead of only a tag, see #1995 for details).

abrokenjester added a commit that referenced this issue Apr 1, 2020
This will not work with squash and merge strategy because PR description
does not end up in the commit message (which is what triggers
auto-close).
abrokenjester added a commit to eclipse-rdf4j/rdf4j-doc that referenced this issue Apr 2, 2020
)

* eclipse-rdf4j/rdf4j#2011 added merge strategy steps and motivation

- added links between workflow and merge strategy page
- updated workflow in several places to conform to current understanding

* GH-2011 release description updated to conform with squash and merge as default
abrokenjester added a commit that referenced this issue Apr 2, 2020
* GH-2011 explain merge strategy in PR template

- added checklist to help authors
- added links to more detailed explanation of merge strategy on website

* GH-2011 PR checklist and merge strategy added to template

* GH-2011 PR description no longer auto-closes issue
  This will not work with squash and merge strategy because PR description does not end up in the commit message (which is what triggers auto-close).
abrokenjester added a commit that referenced this issue Apr 2, 2020
- also added some markdown to make PR message prettier
abrokenjester added a commit that referenced this issue Apr 2, 2020
…2060)

- also added some markdown to make PR message prettier
@abrokenjester abrokenjester reopened this Oct 8, 2020
@abrokenjester
Copy link
Contributor Author

As also discussed on the rdf4j-dev list (see https://www.eclipse.org/lists/rdf4j-dev/msg01651.html) it turns out that squash-and-merge has a serious problem with attribution: it overwrites the author field, which, particularly for third party contributors (where we need the author field to correspond with the sign-off for ECA verification) causes IP provenance problems.

I am at this point resigned to give up on any and all "clever" schemes, and going back to option 1, merge commits, with the occasional request to PR authors to do a manual squash/rebase (to clean up the commit history and rewrite commit messages where necessary).

I will adapt our PR template and the workflow documentation accordingly, and will also ask the Eclipse webmaster to disable squash-and-merge as an option again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔧 internal task a project workflow / setup task not directly impacting end users
Projects
None yet
2 participants