-
Notifications
You must be signed in to change notification settings - Fork 3
SoC 2012 Ideas
This page contains project ideas for Google Summer of Code 2012 from the Git user and development community.
If you're active in the git community and have an idea that you would feel comfortable mentoring, feel free to add it. If you have an idea but need to find a suitable mentor, please bring it up on the git mailing list; others can help you develop the idea and may volunteer to mentor.
If you're a prospective GSoC student, read through the ideas and see if any interest you. But note that these are ideas, not project proposals; they may need details filled in or expanded to make a good project. Find an area that interests you and start a discussion on the git mailing list, even if it's just by asking more about the topic. A good proposal will be clear about the problem to be solved, the history of work in that area, and the specifics of the approach that the GSoC project will take. You can find some of those answers by reading the code and searching the list archives, but discussing the idea with interested developers is a great way for both the student and mentor to reach an understanding of exactly what is to be accomplished.
See also the SoC 2012 Student Application Template, which discusses some elements of a good proposal.
When showing the history of a subset of a project (e.g., git log -- foo.c
), git shows only the history of changes that affects the single pathname.
Because of this, changes made to the content currently at
foo.c
that was previously called bar.c
will not be shown.
We have the --follow
option that switches the path to follow to bar.c
by following renames, but it has some deficiencies.
For example, it follows only a single path, and the path it follows is
global, which means that you cannot follow multiple lines of development
in which the file was renamed on one side and and not the other. Also,
it does not interact well with git's usual history simplification (which
displays a connected subgraph of the history that pertains to foo.c
).
Major topics:
-
Expand --follow to handle arbitrary pathspecs
-
Design and implement a new architecture for --follow that will allow it to mark uninteresting commits as part of the usual history simplification process. Note that care must be taken not to impact the performance of non-follow codepaths.
Proposed mentor: Jeff King
Git generally assumes that the content being stored in it is source code, or some other form of text approximately the same size. While git can handle arbitrary-sized binary content, its base assumptions sometimes mean some operations are slow or unnecessary space is consumed for large binary files (e.g., videos or other media).
There has been some work in the last year on streaming files into and out of the object database when possible. However, there is still more work to be done:
-
Some code paths still load large files into memory. Some other code paths may convert in-pack large files into loose format. These code paths need to be analyzed and fixed.
-
Some code paths, such as diff, require files in memory. When files are not text (as most large files are not), we can often skip this step. However, we sometimes load the whole file just to determine that it is binary, defeating the purpose.
-
Many large files (like media) do not delta very well. However, some do (like VM disk images). Git could split large objects into smaller chunks, similar to bup, and find deltas between these much more manageable chunks. There are some preliminary patches in this direction, but they are in need of review and expansion.
-
Git copies and stores every object from a remote repository when cloning. For large objects, this can consume a lot of bandwidth and disk space, especially for older versions of large objects which are unlikely to be accessed. Git could learn a new alternate repository format where these seldom-used objects are stored on a remote server and only accessed on demand.
Proposed mentor: Jeff King
Git is mostly written single-threaded, with a few commands having bolted-on extensions to support parallel operation (notably git-grep, git-pack-objects and the core.preloadIndex feature).
We have recently looked into some of these areas and made a few optimizations, but a big roadblock is that pack access is entirely single-threaded. The project would consist of the following steps:
-
In preparation (the half-step): identify commands that could benefit from parallelism.
git grep --cached
andgit grep COMMIT
come to mind, but most likely alsogit diff
andgit log -p
. You can probably find more. -
Rework the pack access mechanisms to allow the maximum possible parallel access.
-
Rework the commands found in the first step to use parallel pack access if possible. Along the way, document the improvements with performance tests.
The actual programming must be done in C using pthreads for obvious reasons. At the very least you should not be scared of low-level programming. Prior experience and access to one or more multi-core computers is a plus.
Proposed by: Thomas Rast
Possible mentor(s): Thomas Rast
Important: this is a very difficult project; see the longer explanations posted to the mailing list.
Git is pretty slow when managing huge repositories in terms of files
in any given tree, as it needs to rewrite the index (in full) on
pretty much every operation. For example, even though logically
git add already_tracked_file
only changes a single blob SHA-1 in the
index, Git will verify index correctness during loading and recompute
the new hash during writing over the whole index. It thus ends up
spending a large amount of time simply on hashing the index.
A carefully designed index format could help in several ways. (For the complexity estimates below, let n be the number of index entries or the size of the index, which is roughly the same.)
-
The work needed for something as simple as entering a new blob into the index, which is possibly the most common operation in git (think
git add -p
etc.) should be at most log(n). -
The work needed for a more complex operation that changes the number of index entries will have to be larger unless we get into database land. However the amount of data that we SHA-1 over should still be log(n).
-
It may be possible to store the cache-tree data directly as part of the index, always keeping it valid, and using that to validate index consistency throughout. If so, this would be a big boost to other git operations that currently suffer from frequent cache-tree invalidation.
Note that there are other criteria than speed: the format should also be as easy to parse as possible, so as to simplify work for the other .git-reading programs (such as jgit and libgit2). For the same reason, you will also have to show a significant speed boost as otherwise the break in compatibility is not worth the fallout.
The programming work will be in C, as it replaces a core part of git.
Proposed by: Thomas Rast
Possible mentor(s): Thomas Rast
The interface behind git {add|commit|stash|reset} {-p|-i}
is shared
and called git-add--interactive.perl
. This project would mostly
focus on the --patch
side, as that seems to be much more widely
used; however, improvements to --interactive
would probably also be
welcome.
The --patch
interface suffers from some design flaws caused largely
by how the script grew:
-
Application is not atomic: hitting Ctrl-C midway through patching may still touch files.
-
The terminal/line-based interface becomes a problem if diff hunks are too long to fit in your terminal.
-
Cannot go back and forth between files.
-
Cannot reverse the direction of the patch.
-
Cannot look at the diff in word-diff mode (and apply it normally).
Due to the current design it is also pretty hard to add these features without adding to the mess. Thus the project consists of:
-
Come up with more ideas for features/improvements and discuss them with users.
-
Cleanly redesigning the main interface loop to allow for the above features.
-
Implement the new features.
As the existing code is written in Perl, that is what you will use for this project.
Proposed by: Thomas Rast
Possible mentor(s): Thomas Rast, Christian Couder
Write a remote helper for Subversion. While a lot of the underlying infrastructure work was completed last year, the remote helper itself is essentially incomplete. Major work includes:
-
Understanding revision mapping and building a revision-commit mapper.
-
Working through transport and fast-import related plumbing, changing whatever is necessary.
-
Getting an Git-to-SVN converter merged.
-
Building the remote helper itself.
Goal: Build a full-featured bi-directional git-remote-svn
and get it
merged into upstream Git.
Language: C
See: A note on SVN history, svnrdump.
Proposed by: David Barr
Possible mentors: Jonathan Nieder, Sverre Rabbelier, David Barr
Git.pm was created in 2006 to make it easy to call git commands from Perl scripts safely, in a portable way (including workarounds required for ActiveState Perl on MS Windows). Its error handling via exceptions also comes from that year.
Git.pm module uses Error (and Error::Simple) for its exception handling. Unfortunately, while it might looked like a good choice in 2006, Error module is deprecated:
WARNING
Using the "Error" module is no longer recommended due to the black-magical nature of its syntactic sugar, which often tends to break. Its maintainers have stopped actively writing code that uses it, and discourage people from doing so. See the "SEE ALSO" section below for better recommendations.
Nowadays the recommended solution to exception handling in Perl are Try::Tiny (or TryCatch, but I don't think Git.pm would need this more heavyweight module) for capturing and handling exceptions, and Exception::Class (or Throwable, but that requires heavyweight Moose object system) for throwing OO exceptions.
The major goal would be to update Git.pm to modern Perl conventions, amon others moving from Error / Error::Simple to Try::Tiny and Exception::Class, preserving backwards compatibility, but perhaps also adding a better interface and using it in git commands implemented in Perl.
Other optional goals would be to extend Git.pm, for example adding Git::Config module which would read git configuration once like gitweb does, or Git::Commit module for parsing commit objects, etc.
Programming language: Perl
Proposed by: Jakub Narębski
Possible mentor(s): Jakub Narębski (?)
Gitweb (git web interface) includes some optional client-side scripting using JavaScript. This includes checking if JavaScript is available and remembering this information so gitweb can choose JavaScript-only version of a view (javascript-detection), selecting common timezone to use when showing dates (adjust-timezone), and AJAX-y incremental blame view (blame_incremental).
Currently all this is done using hand-written JavaScript. This means that gitweb scripting includes handling cookies, formatting output, processing dates, and smoothing out incompatibilities between browsers (like e.g. XmlHttpRequest creation).
This is redoing work which JavaScript libraries, such as jQuery, MooTools or YUI already did. Moreover, if we want to add new features (e.g. table sorted using JavaScript), or improve existing ones, we would have to re-implement existing JavaScript code. Also our hand-crafted code is not as well tested as widely used JavaScript libraries.
The goal of this project is to move gitweb client side scripting to use some JavaScript library / JavaScript framework.
The project would consist of the following steps:
-
Add support for configuring and loading external JavaScript library to
gitweb/gitweb.perl
andgitweb/Makefile
. It would be nice (though not necessary) to be able to use local version of library, and have such feature well documented. -
Remove gitweb's JavaScript mini-library in
gitweb/static/js/lib
and replace it part by part by appropriate JavaScript library functions (methods). -
Replace DOM selectors by library version, if applicable.
-
Optional: emulate 'onprogress' in XmlHttpRequest using native JavaScript library mechanism (creating a class, or whatever).
-
Optional: better deferring of repainting in incremental blame.
Note that we require that client-side scripting in gitweb follow progressive enhancement strategy; gitweb should work correctly, perhaps with reduced functionality, even if JavaScript is turned off, or external JavaScript library cannot be loaded.
Programming language: JavaScript
Proposed by: Jakub Narębski
Possible mentor(s): Jakub Narębski
git-instaweb is a tool for browsing a repository (in gitweb) with
a web browser. To use it, simply run git instaweb
inside
repository. It would set up gitweb and a web server, and by default
also run a web browser. Web interface would be available at
http://localhost:1234
.
For informal, ad-hoc sharing it would be nice if there was an option
to git instaweb
that would make it also allow remote machines
to pull from you (via HTTP), similar to hg serve command in
Mercurial version control system.
git-instaweb supports many web servers. Currently apache2, lighttpd,
mongoose, plackup and webrick are supported. It is not necessary to
implement support for --serve
in all of them.
The project would consist of the following steps:
-
Configure web server to run
git-http-backend
CGI program to serve git repositories over HTTP. It should probably allow by default only read-only use. -
Configure gitweb to show fetch URL in 'summary' page.
-
If possible for given web server (and with available modules), configure it so that "smart" HTTP server and gitweb share common URL (are available under the same URL).
-
If possible use mechanism native for a web server used, for example Plack::App::GitSmartHttp for 'plackup' web server, or grack with WEBrick.
-
Optionally: add support for pushing (disabled by default).
-
Optionally: add documentation about using
git instaweb
to "Git User's Manual" (or one of tutorials), similar to appropriate chapter in hgbook i.e. "Mercurial: The Definitive Guide".
The minimum would be to implement pull support under different URL than web interface, and only for one web server.
Programming language: shell script
Proposed by: Jakub Narębski
Possible mentor(s): Jakub Narębski, Eric Wong, ...
The library currently has support for fetching code over the http and git protocols. The aim of this project is to finish the support for the remaining network operations:
Major goals:
- Fetch over SSH, including a sane API for managing the user's SSH keys.
- Push over http, git, and SSH, with pack-objects as a prerequisite.
This is not as terribly complicated as it looks, because the existing networking code is well modularized and easy to extend: most of the required sockets functionality is already in place.
Programming language: C89
Possible mentor(s): Vicent Marti, Russell Belfer
The "-3" option "git am" understands is useful only when you are applying a full format-patch submission. Teaching the three-way fallback to underlying "git apply" would make the feature avialable in more use cases, and later can help making the implementation of "git am -3" simpler.
Programming language: C89
Suggested by: Junio C Hamano
Possible mentors: ???, Christian Couder
Early in the Git development history, Linus envisioned an "ultimate content tracking tool" in a message. Starting from the current codebase, it would dig deeper in the history to answer "where did this line come from?" and even find out:
"oops, that line didn't even exist in the previous version, BUT I FOUND FIVE PLACES that matched almost perfectly in the same diff, and here they are"
We already have the part to dig through the history to find the commit that introduced the line in question as "git rev-list -S", but we do not have any tool to help the "BUT I FOUND FIVE PLACES that matched almost perfectly" part.
Write a tool that can be used for the task, and optionally wrap an interactive UI around it.
Programming language: Any
Possible mentors: ???, Christian Couder
git-gui is a portable Tcl/Tk based graphical interface to Git, focused on commit generation and single file annotation. It is part of Git, even though it is developed in a separate repository.
git-gui can show differences as syntax-highlighted unified diff. The goal of this project would be to add graphical side-by-side diff. One possibility is to make use of TkDiff code (a graphical diff and merge tool), which is also GPL licensed.
The project would consist of the following steps (not all must be implemented during Google Summer of Code):
-
Add "git gui diff" subcommand, which would show single file side-by-side graphical compare or/and graphical diff.
Graphical part and code for side-by-side compare (showing full contents of both files) can be taken from TkDiff, while graphical side-by-side diff (showing changes plus context) can be translated from gitweb's side-by-side diff code.
-
Integrate graphical diff with main git-gui application (including switching between side-by-side and unified diff).
-
Add highlighting of changes in diff both to side-by-side (new code), and to unified diff (current code).
It can be done using code for TkDiff (perhaps with Git performing word diff instead of doing it in Tcl), or/and using algorithm from gitweb and diff-highlight in contrib. The difference is that one uses word diff or character diff to highlight changes, the other just skips common prefix and suffix.
-
Add graphical merge / graphical 3-way diff support. Both are to be used in case a file has textual conflicts; graphical merge also includes resolving a merge by taking 'our' side, 'their' side (or optionally also 'ancestor' side), or by editing merge result.
-
Graphical side-by-side tree level diff, or side-by-side directory listing with differences highlighted.
UI can be taken from two-panel filemanagers (like MC, or Total Commander), or from synchronization tools (like e.g. Unison).
The difficulty can be in showing rename and copy detection results, and in showing type (filemode) changes.
The minimum would be to implement side-by-side diff or side-by-side
compare of two files, without highlighting changes (diff refinement
highlighting) in the form of separate "git gui diff <file>
"
command.
Programming language: Tcl/Tk
Proposed by: Jakub Narębski
Possible mentor(s): Pat Thoyts, Paul Mackerras (?)
Git includes protection against rewriting published history on the
receive side with fast-forward check by default (which can be
overridden) and various receive.deny*
configuration variables,
including receive.denyNonFastForwards
.
Nevertheless git users requested (among others in "Git User's Survey") more help on creation side, namely preventing rewriting parts of history which was already made public (or at least warning that one is about to rewrite published history). This requires knowledge if a commit was published or not.
There is a related feature of preventing publishing commits that are "secret" (or simply not ready to be published).
The problem and inspiration for it is described in more detail in this thread (unfortunately split in two in GMane interface; you can use MARC archive instead). You can even find a prototype implementation for parts of this in git mailing list archives.
The project would consist of the following elements (not all must be implemented during Google Summer of Code; the scope of the project would have to be discussed on git mailing list prior to application):
-
Some way of querying the state of commit about its "published" and/or "secret" status.
Perhaps at later stages adding new pretty format to git, to be able to show "published" and "secret" trait in
git show
andgit log
output. -
Some way of marking a commit as "secret" (e.g. using notes).
-
Preventing a history-changing operation when it is to act on published part of history, or just warn. This includes but is not limited to rebase, interactive rebase, amending a commit and rewinding a branch (using
git reset
).Note that some commands would require extending git to be able to prevent / warn about reqriting published history, e.g. by adding
pre-rewrite
hook. -
Adding support for less common ways of publishing: bundle, send-email, fetch / pull-request for "published" trait.
-
Making those traits local to repository. For example we might want consider commits changeable if they were published to group repository, or make possible to share "secret" commits in a group repository.
Programming language: Any
Proposed by: Jakub Narębski
Possible mentor(s): ???
- Previous year's SoC ideas: SoC2011Ideas, SoC2010Ideas
- Git users survey
- Small project ideas (probably too small for a SoC)