- Before you start
- Workflow
- Best Practices
- How-to and troubleshooting
- Submodules
- GitHub is the place where we keep our code
git
is the tool (program) for version control- We interact with GitHub via
git
- Use public key for authorization
- You can add a new public key here GH -> Personal settings -> SSH keys
- More details about what is public key you can find in all.ssh.how_to_guide.md
- Read at least the first 3 chapters of Git book
- Read about
Git Submodules
- We use Git submodules to compose and share code about repos
-
Run
git fetch
# Fetch all the data from origin. > git fetch # List all the branches. > git branch -r origin/HEAD -> origin/master origin/PTask274 ...
-
Checkout master and pull
- You want to branch from the latest version of master to avoid a merge:
# Checkout the `master` branch. > git checkout master # Make sure your local `master` is in sync with the remote. > git pull --rebase
- Alternatively, and especially if you have local changes to move to a new
branch, run
> git checkout master > i git_pull
- You want to branch from the latest version of master to avoid a merge:
-
Name a branch after its corresponding issue
- The canonical name for a new feature branch is obtained by running
i gh_issue_title
:> i gh_issue_title -i 274 INFO: > cmd='/Users/saggese/src/venv/amp.client_venv/bin/invoke gh_issue_title -i 274' report_memory_usage=False report_cpu_usage=False ## gh_issue_title: issue_id='274', repo_short_name='current' ## gh_login: 07:35:54 - INFO lib_tasks_gh.py gh_login:48 account='sorrentum' export GIT_SSH_COMMAND='ssh -i /Users/saggese/.ssh/id_rsa.sorrentum.github' gh auth login --with-token </Users/saggese/.ssh/github_pat.sorrentum.txt # Copied to system clipboard: CmTask274_Update_names: https://github.com/sorrentum/sorrentum/pull/274
- Before running verify that GitHub cli
gh
works> gh --version gh version 2.29.0 (2023-05-10) https://github.com/cli/cli/releases/tag/v2.29.0
- Before running verify that GitHub cli
- The name is
CmTask274_Update_names
- To use multiple branches for a given task, append a numeral to the name,
e.g.,
CmTask274_Update_names_v02
- The canonical name for a new feature branch is obtained by running
-
Create and checkout the branch
> git branch my_feature > git checkout my_feature
- Alternatively, you can create and checkout in one command with
> git checkout -b my_feature
- From this point on, you commit only in the branch and changes to master will not affect your branch
- If the branch already exists, check out the branch by executing
> git checkout my_feature
- Alternatively, you can create and checkout in one command with
-
Commit your work early and often
- Commits on your feature branch do not affect master. Checkpoint your work
regularly by committing:
> git status On branch my_feature > git add ... > git commit [my_feature 820b296] My feature is awesome!
- Commits stay local (not seen on GitHub) until you explicitly tell git to "upload" the commits through git push (see next)
- Commits on your feature branch do not affect master. Checkpoint your work
regularly by committing:
-
Push your feature branch changes upstream
- When you commit, commits are local (not seen on GitHub)
- When you want your code to be pushed to the server (e.g., to back up or to
share the changes with someone else), you need to push the branch upstream
> git push -u origin my_feature ... 30194fc..820b296 my_feature -> my_feature Branch 'my_feature' set up to track remote branch 'my_feature' from 'origin'.
- Note that
-u
tells git to set the upstream of this branch to origin - This operation is needed only the first time you create the branch and not
for each
git push
-
Merge
master
into your feature branch regularly- Merge
master
into your feature branch at least once a day, if the branch stays around that long:# Get the .git from the server > git fetch # Make your master up to origin/master. > git checkout master > git pull # Merge master into my_feature branch. > git checkout my_feature > git merge master
- A simpler flow which should be equivalent
- TODO(gp): Verify that
# Get the .git from the server > git fetch # Merge master into my_feature branch. > git checkout my_feature > git merge origin/master
- TODO(gp): Verify that
- Merge
-
Repeat Steps 4-7 as needed
-
Request a review of your work by making a pull request (PR)
- Verify that your work is ready for a review by going through this checklist:
- The PR is self-contained
- The latest
master
has been merged into the feature branch - All files in the PR have been linted with
linter.py
- All tests pass
- If your work is ready for review, make a pull request
- Use the GitHub UI (for now; we may replace with a script). Go to the branch on the web interface and push "Compare & pull request"
- Make sure that GP and Paul are assigned as reviewers, as well as anyone else who may be interested
- Make sure that GP and Paul are assigned as assignees
- Follow up on all comments and mark as resolved any requested changes that you resolve
- Verify that your work is ready for a review by going through this checklist:
- Avoid checking in large data files
- The reason is that large files bloat the repo
- Once a large file is checked in, it never goes away
- Therefore, DO NOT CHECK IN DATA FILES IN EXCESS OF 500K
- If in doubt (even on a branch), ask first!
- Sometimes is makes sense to check in some representative data for unit tests
- BUT, larger tests should obtain their data from s3 or MongoDB
- One of the advantages of working with Git is that branches are cheap
- In an ideal world
master
branch is sacred (see Platinum rule of Git)- Development should never be done directly on master
- Changes to master should only happen by pull-request or merge
- One should avoid working in master except in rare cases, e.g., a simple urgent bug-fix needed to unblock people
master
should be always never broken (all tests are passing and it is deployable)
- Generally it is best to be the sole contributor to your branch
- If you need to collaborate with somebody on a branch, remember that the golden rule of rebase still applies to this "public" branch: "do not rebase pushed commits"
- It is ok to open multiple branches for a given task if needed
- E.g., if you have multiple chunks of work or multiple people are working on orthogonal changes
- It might be that the task is too big and needs to be broken in smaller bugs
- All the rules that apply to
master
apply also to a branch- E.g., commit often, use meaningful commit messages.
- We are ok with a little looser attitude in your branch
- E.g., it might be ok to not run unit tests before each commit, but be careful!
- Use a branch even if working on a research notebook
- Try to avoid modifying notebooks in multiple branches simultaneously, since notebook merges can be painful
- Working in a branch in this case facilitates review
- Working in a branch protects the codebase from accidental pushes of code changes outside of the notebook (e.g., hacks to get the notebook working that need to be cleaned up)
- It is easier for you to keep work sane and separated
- Cons of multiple conceptual changes in the same branches
- You are testing / debugging all at once which might make your life more difficult
- Reviewing unrelated changes slows down the review process
- Packaging unrelated changes together that means no change gets merged until all of the changes are accepted
- Make sure your PR is coherent
- It may not need to do everything the Task requires, but the PR should be self-contained and not break anything
- If you absolutely need changes under review to keep going, create the new
branch from the old branch rather than from master (less ideal)
- Try to avoid branching from branches
- This creates also dependencies on the order of committing branches
- You end up with a spiderweb of branches
- Try to avoid branching from branches
- Frequent small PRs are easier to review
- You will also experience faster review turnaround
- Reviewers like working on smaller changes more than working on larger ones
- PR review time does not scale linearly with lines changed (may be more like exponential)
- Merging changes frequently means other people can more easily see how the code
is progressing earlier on in the process, and give you feedback
- E.g., "here it is a much simpler way of doing this", or even better "you don't need to write any code, just do <this_and_that>"
- Merged changes are tested in the Jenkins build
- You can run the script
dev_scripts/git/git_branch.sh
to get all the branches together with some information, e.g., last commit and creator - E.g., let's assume we believe that
PTask354_INFRA_Populate_S3_bucket
is obsolete and we want to delete it:- Get
master
up to date> git checkout master > git fetch > git pull
- Merge
master
into the target branch - Pull and merge
> git checkout PTask354_INFRA_Populate_S3_bucket > git pull > git merge master
- Resolve conflicts
> git commit > git pull
- Ask Git if the branch is merged
- One approach is to ask Git if all the changes in master are also in the
branch
> git branch `PTask354_INFRA_Populate_S3_bucket` --no-merged PTask354_INFRA_Populate_S3_bucket
- Note that Git is very strict here, e.g.,
PTask354_INFRA_Populate_S3_bucket
is not completely merged since I've moved code "manually" (not only throughgit cherry-pick, git merge
) - One approach is to just merge
PTask354_INFRA_Populate_S3_bucket
into master and rungit branch
again
- One approach is to ask Git if all the changes in master are also in the
branch
- Manually check if there is any textual difference
- Another approach is to check what the differences are between the branch
and
origin/master
> git log master..HEAD 6465b0c saggese, 25 seconds ago : Merge branch 'master' into PTask354_INFRA_Populate_S3_bucket (HEAD -> PTask354_INFRA_Populate_S3_bucket, origin/PTask354_INFRA_Populate_S3_bucket) > git log HEAD..master
- Here we see that there are no textual differences
- So we can either merge the branch into
master
or just kill directly
- Another approach is to check what the differences are between the branch
and
- Kill-kill-kill!
- To delete both the local and remote branch you can do
> git branch -d PTask354_INFRA_Populate_S3_bucket > git push origin --delete PTask354_INFRA_Populate_S3_bucket
- To delete both the local and remote branch you can do
- Get
- If you are working in a branch, before doing
git push
make sure the branch is not broken (e.g., from a mistake in merge / rebase mess) - A way to check that the branch is sane is the following:
- Make sure that you don't have extra commits in your branch:
- The difference between your branch and master
bash > git fetch > git checkout <BRANCH> > git log origin/master..HEAD
shows only commits made by you or, if you are not the only one working on the branch, only commits belonging to the branch with the samePTaskXYZ
- E.g., if George is working on
PTask275
and sees that something funny is going on:a379826 Ringo, 3 weeks ago : LemTask54: finish dataflow through cross-validation 33a46b2 George, 2 weeks ago : PTask275 Move class attributes docstrings to init, change logging
- The difference between your branch and master
- Make sure the files modified in your branch are only the file you expect to
be modified
> git fetch > git checkout <BRANCH> > git diff --name-only master..HEAD
- If you see that there is a problem, don't push upstream (because the branch will be broken for everybody) and ask a Git expert
- Make sure that you don't have extra commits in your branch:
- You can see the files modified in a given commit hash with:
> git show --name-only $HASH
- E.g.,
> git show --name-only 39a9e335298a3fe604896fa19296d20829801cf2 commit 39a9e335298a3fe604896fa19296d20829801cf2 Author: Julia <julia@...> Date: Fri Sep 27 11:43:41 2019 PTask274 lint vendors/cme/utils.py vendors/first_rate/utils.py
- To see the files in conflicts
git diff --name-only --diff-filter=U
- This is what the script
git_conflict_files.sh
does
> git checkout --theirs $FILES
> git add $FILES
- TODO(gp): Fix this ours and theirs. The first option represents the current
branch from which you executed the command before getting the conflicts, and
the second option refers to the branch where the changes are coming from.
> git show :1:README > git show :2:README > git show :3:README
- Stage #1 is the common ancestor of the files, stage #2 is the target-branch version, and stage #3 is the version you are merging from.
- If one screws up a branch:
- Rebase to master
- Resolve the conflicts
- E.g., pick the
master
version when needed:git checkout --theirs ...; git add ...
- E.g., pick the
- Diff the changes in the branch vs another client at
master
> diff_to_vimdiff.py --dir1 $DIR1/amp --dir2 $DIR2/amp --skip_vim Saving log to file '/Users/saggese/src/...2/amp/dev_scripts/diff_to_vimdiff.py.log' 10-06_15:22 INFO : _parse_diff_output:36 : Reading '/tmp/tmp.diff_to_vimdiff.txt' # DIFF: README.md # DIFF: core/dataflow.py # DIFF: core/dataflow_core.py # DIFF: core/test/test_core.py # DIFF: dev_scripts/diff_to_vimdiff.py # ONLY: diff_to_vimdiff.py.log in $DIR1/dev_scripts # DIFF: dev_scripts/grc # ONLY: code_style.txt in $DIR2/docs/notes ... # DIFF: vendors/test/test_vendors.py
- Diff / merge manually the files that are different
> diff_to_vimdiff.py --dir1 $DIR1/...2/amp --dir2 $DIR2/...3/amp --skip_vim > --only_diff_content # DIFF: README.md # DIFF: core/dataflow.py # DIFF: core/dataflow_core.py # DIFF: core/test/test_core.py ...
> git reset --soft HEAD~
- Look at all the branches available:
# Fetch all the data from origin. > git fetch # List all the branches. > git branch -r origin/HEAD -> origin/master origin/PTask274 ...
- Go to the branch:
> git checkout PTask274
- Check what are the commits that are in the current branch HEAD but not in
master
:> gll master..HEAD > git log master..HEAD eb12233 Julia PTask274 verify dataset integrity ( 13 hours ago) Sat Sep 28 18:55:12 2019 (HEAD -> PTask274, origin/PTask274) ... a637594 saggese PTask274: Add tag for review ( 3 days ago) Thu Sep 26 17:13:33 2019
- To see the actual changes in a branch you can't do (Bad)
> git diff master..HEAD
sincegit diff
compares two commits and not a range of commits likegit log
(yes, Git has a horrible API) - What you need to do is to get the first commit in the branch and the last from
git log
and compare them:> git difftool a637594..eb12233 > gd a637594..eb12233
- You can see what commits are in master but missing in your branch with:
> gll ..master de51a7c saggese Improve script to help diffing trees in case of difficult merges. Add notes from reviews ( 5 hours ago) Sat Oct 5 11:24:11 2019 (origin/master, origin/HEAD, master) 8acd60c saggese Add a basic end-to-end unit test for the linter ( 19 hours ago) Fri Oct 4 21:28:09 2019 …
- You want to
rebase
your feature branch ontomaster
- This is useful if we want to focus on changes on a single dir
> git ll master..PTask274 vendors/cme 39a9e33 Julia PTask274 lint ( 2 days ago) Fri Sep 27 11:43:41 2019 c8e7e1a Julia PTask268 modify according to review16 ( 2 days ago) Fri Sep 27 11:41:47 2019 a637594 saggese PTask274: Add tag for review ( 3 days ago) Thu Sep 26 17:13:33 2019
> git diff --name-only a637594..33a46b2 -- vendors helpers helpers/csv.py vendors/cme/utils.py vendors/first_rate/Task274_verify_datasets.ipynb vendors/first_rate/Task274_verify_datasets.py vendors/first_rate/reader.py vendors/first_rate/utils.py vendors/test/test_vendors.py
- If your branch lives long, you want to apply changes made on master to show on your branch
- Merge flow
- Assume your branch is clean
- E.g., everything is committed, or stashed
- Pull changes from
master
on the remote repo> git checkout master > git pull
- Checkout your feature branch
> git checkout my_feature
- Merge stuff from
master
tomy_feature
> git merge master --no-ff ... editor will open a message for the merge commit ...
- In few informal words, the
--no-ff
option means that commits are not "inlined" (similar to rebase) even if possible, but a merge commit is always used- The problem is that if the commits are "inlined" then you can't revert the change in one shot like we would do for a merge commit, but you need to revert all the inlined changes
- For now, we suggest avoiding the rebase flow
- The reason is that rebase makes things cleaner when used properly, but can get you into deep trouble if not used properly
- You can rebase onto
master
, i.e., you re-apply your changes tomaster
- Not the other way around: that would be a disaster!
> git checkout my_feature # See that you have that master doesn't have. > git ll origin/master.. # See that master has that you don't have. > git ll ..origin/master > git rebase master > git ll ..origin/master # Now you see that there is nothing in master you don't have > git ll origin/master.. # You can see that you are ahead of master
- The procedure for manual merges is as follows
- Do not merge yourself unless explicitly requested by a reviewer
- Pull changes from remote
master
branch> git checkout master > git pull
- Merge your branch into
master
without fast-forward> git merge --no-ff my_feature
- Push the newly merged master
> git push
- Delete the branch, if you are done with it:
> git branch -d my_feature
- Following the instructions in https://git-scm.com/book/en/v2/Git-Tools-Submodules
- When you work in a submodule, the flow should be like:
- Create a branch in a submodule
- Do your job
- Push the submodule branch
- Create a PR in the submodule when you are done
- After the submodule PR is merged:
- Checkout the submodule in the master branch and do
git pull
- In the main repo, create a branch like
PTask1234_update_submodule
- From the new branch do
git add <submodule_name>
, e.g.,git add amp
- Commit changes, push
- Create a PR
- Checkout the submodule in the master branch and do
- Run the script:
> dev_scripts/git/git_submodules_are_updated.sh
- Run the script:
> dev_scripts/git/git_submodules_roll_fwd.sh
> git submodule foreach git clean -fd
- This is useful when merging
master
in a different branch and we don't want to checkout master just to pull> git fetch origin master:master
- Run the script
> dev_scripts/git/git_submodules_pull.sh
or> git submodule update --init --recursive` > git submodule foreach git pull --autostash