-
-
Notifications
You must be signed in to change notification settings - Fork 381
What should we teach about provenance? #429
Comments
I haven't actually used it, but there is a nice looking packaged called Sumatra that is designed to do provenance tracking, recording information from the VCS in a database when you run a program, script, etc, using it's interface. I'm not advocating that we teach or recommend this particular implementation, but in case others haven't seen it, it may be useful. However, taking a step back, even if Git added revision info to the actual file, like SVN, we would really need to talk about some simple form of logging. If the print statements are just going to stdout, then even if you have access to that information, there is no record tying the results to the run. Given revision/version information, one option is to talk about actually recording provenance information in the results file. That's great if you have control of the results format and file type. What happens if you're using Python or bash to script calling some 3rd party program or module that writes it's own files? Then you need to build your own logging infrastructure. Maybe we can't teach an actual implementation, but we can speak generally about strategies and things to consider when developing an analysis or data generation pipeline. |
A related note. I’m working on a R package called https://github.com/ropensci/git2r If you're one of the cool kids writing in markdown, you can throw this in the footnotes or the acknowledgements ( > markdown_link()
[1] "[58c507](https://github.com/ropensci/git2r/commit/58c5075082d35e63312cf27425283e8e203b69c6)" |
It's also possible to get the > cat .git/HEAD
ref: refs/heads/master
> cat .git/refs/heads/master
29c9fc8cfd5dc151dacf7c1e769d3b23eda549b5 That may not be something we want to try to explain to our learners, but it could lead to some more language-independent ideas. |
With git you can call Integrated in a python script, you can print out your own revision using Example of a script that prints its own revision:
Example use (invoked from outside the git working tree):
You can simplify the git-call if you assume that you run within the git tree, no need to do the directory change then. So you can tutorial wise begin on the command line, then introduce subprocess, then add the directory handling If you use Python 2.7 or later you can make a function that use |
For a project I'm working on we create calibration reference files which are distributed to the community. When we switched from SVN to git we started creating git tags to tag a given commit with the reference file name and when it was made available to the community - we also put this tag in the header of the file. |
+1 to teaching logging instead of embedding the revision number in the output. This information always seemed to me to be a characteristic of the entire output of a command (documents + aux files + figures etc.), not of a particular file/document, and thus it seems that it should live in a log file associated with all of the output. I can see at least two practical advantages as well -
Either way, I think extracting the git commit hash is the right way to go, but we need to include a flag that indicates (at a minimum) whether the repo was clean at the time of the command. Otherwise the output won't match the file state at the recorded commit hash if there have been changes since the last commit. |
@gvwilson Please correct me if I misunderstanding your question. You want the follow behavior:
I believe that you can accomplish this behavior using git hook. I can't write a proof of concept for it right now. |
@jkitzes Nice, thanks for that heads up. I will modify the function to add a flag (or warning) in case repo is not clean. |
@karthik happy to help. On that note, here's another issue that I came across when I was trying to solve this same problem a few years ago (which I never did, as I couldn't think of a good solution to the issues below). This problem arises if there is an actual substitution occurring in a file that's under version control itself (like a LaTeX source file), which is then compiled into some output that's supposed to have the version number/hash. In this case, the commit has to happen before the substitution (so that the hash is available), but once the substitution happens, the repo will be dirty again with the only change being to the revision text (since the substitution happened after the commit). This means that you effectively "re-dirty" your repo with every commit. Question: how does Subversion deal with this (perhaps it calculates the Revision number first, updated the files, then committed everything including the updated numbers)? AFAIK there's no way around this given how the git hashes are generated. |
Excellent question, @jkitzes |
Our solutions so far work for code by retrieving the repo state at runtime and embedding it in products (which may or may not be tracked). To do similar for something like LaTeX one would probably need to turn to a tool like dexy that allows you to run code and embed the output in your document source as part of the compilation process. |
@jiffyclub, so when the repo state is embedded in a tracked product during a run, doesn't this dirty the repo again? |
What you want to avoid is having repo state embedded in the product generators so that it's impossible to log the state of the source. So long as the repo state is only embedded in end products I think you avoid the paradox. (But that raises the question of why you're versioning end products in the first place.) |
@stain You can call git rev-parse from outside the repository by specifying the --git-dir option. For example: git --git-dir=/path/to/repository/.git rev-parse HEAD Of course, this requires knowing the location of the .git directory ahead of time. If the file in question is in the top-level directory of the repository, you can do: git_directory = os.path.join(os.path.dirname(sys.argv[0]), '.git') If the file happens to be in a sub-directory though, this won't work. Offhand, I don't know how to get the location of the .git directory for a file at an arbitrary level in a repository, short of walking up the directory tree yourself. This wouldn't be too difficult, but I'm not sure it's necessarily a better solution than just cd'ing to the directory... FWIW, when I tackled this issue switching from svn to git, I found it a lot easier to log the current commit hash (plus clean or dirty state) of the repository in the program output than to store it directly in the program itself. |
Every time I run a script, a time stamp like the following is either be added to the 'history' in the global attributes out the output file (if I'm dealing with a self describing format like netCDF) or to a '.met' file in the case of a figure or an output that isn't self describing. This is generated as follows:
The only thing I'm not sure about with this solution is how to get python to tell me which installation of python I used on my machine (i.e. instead of just writing 'python' in the time stamp, I'd like it to say if I used |
I just found the script I used to call at the top of all my bash jobs. I got out of the habit of using this - should start again... #!/bin/sh
echo "Checking git repositories"
echo `alias`
for dir in $@
do
branch_sync=`git --git-dir $dir/.git --work-tree=$dir status | grep -P -o "(?<=# Your branch is )(ahead|behind).*by [0-9]* (commits|commit)"`
hash=`git --git-dir $dir/.git log --pretty=format:'%h' -n 1`
if [[ `git --git-dir $dir/.git --work-tree=$dir diff HEAD --cached --abbrev=40 --full-index --raw` != "" ]];
then
echo "$dir is dirty, changes staged for commit";
elif [[ `git --git-dir $dir/.git --work-tree=$dir diff` != "" ]];
then
echo "$dir is dirty, changes not staged for commit";
elif [[ `git --git-dir $dir/.git --work-tree=$dir ls-files --other --exclude-standard` != "" ]];
then
echo "$dir is dirty, untracked files present";
else
echo "--> $dir - $hash - CLEAN"
fi
if [[ $branch_sync != "" ]];
then
echo " ($branch_sync)"
fi
echo ""
done It's invoked with a list of repositories (in my case, they all live in Output looks like this:
|
@DamienIrving I like the idea of doing this for python scripts - might copy a bit of your code for myself! The path to the current python executable is in sys.executable, and the python version is in sys.version - hope that helps! |
My recommendation would be to include the git revision number in the executable or distribution during a build step, through a template file. So I would teach this when teaching scons/cmake/setuptools. Actually changing the source-files themselves on every commit has turned out to cause merge headaches, a quick bit of research suggests, so isn't recommended any more, now that people are branching and merging more casually. |
|
From http://stackoverflow.com/questions/7016300/git-revision-number-in-source-code-documentation
The answer referenced above is http://stackoverflow.com/questions/645008/what-are-the-basic-clearcase-concepts-every-developer-should-know/645424#645424
|
It's not at all clear to me why it's useful to have a revision count inside a source file. The revision count is just a bad headache from Version control should be cleanly separated from code imho - let The only exception is the version number for the whole program, which is not the same as the sub-minor version revision. Users should only be seeing tagged releases, not using development code. I think it's especially inadvisable to use a git hook or script to insert the metadata. The hook or script is very unlikely to follow the code everywhere it goes, which will lead to the metadata in the code becoming disconnected from the true provenance. It's easy to see what modifications were made to a file with |
Here is my simple solution for recording the provenance of my data analyses in R. I think it could be included in an R bootcamp as long as literate programming with knitr is also covered (though I'm not sure it will work for Windows). I organize my project so that I have separate subdirectories for the code and data, which are separate git directories. I run all my scripts from within the code subdirectory. At the top of each R Markdown file, I have the following lines:
This prints the commit hash for the code and the data at the top of each analysis, which is similar to what others have suggested above. I post the rendered html of the analysis to my electronic science notebook and/or distribute it to collaborators. And if I or someone else wants an old version of a figure created, I can easily roll back both repositories. Also, I report the version of R and any loaded packages used in the session at the end of each file using This doesn't solve the more complicated problems of determining if the repo is clean or finding the path to the script's git directory if it isn't already known, but it is a start that should be accessible to learners after two days of learning R, shell, and git. |
+1 for @jdblischak's solution. The thing that is useful is stamping the report with the revision, not stamping the code with the revision. |
Is the scope of this discussion just git-related provenance tracking In terms of teaching provenance in a swcarpentry setting, I would also just try to give an overview of the concept. This discussion, while very informative, seems fairly narrow in the broader scope of tracking provenance. I imagine that in several situations peoples workflows have components that are not easily under their control or not in git. Potentially a good reference for an overview and some exposure to the broader issue of provenance is this http://www.w3.org/TR/2013/NOTE-prov-primer-20130430/#intuitive-overview-of-prov |
@ https://github.com/ged-lab/khmer I use https://github.com/warner/python-versioneer + git tags to report an accurate version, either "v1.0" for a release or "v0.8-105-g259a2a5" for the 105th commit after v0.8 with the hash g259a2a5 |
@rbeagrie Thanks! That makes for a pretty nice solution in Python...
|
On 10 Apr 2014 16:50, "Kyle Cranmer" notifications@github.com wrote:
I'm glad someone found it useful! Here are some slides I have about provenance: http://www.slideshare.net/soilandreyes/20130321-what-can-provenance-do-for-me (See pptx version for animations) http://practicalprovenance.wordpress.com for more Agreed, showing the general principles of Provenance is a good thing. Tools |
It used to be easy: when we taught version control with Subversion, we told people that if they put:
in a file, and set the file's properties correctly, Subversion would automatically update that string every time the file was changed so that it read:
(or whatever the revision number was). This worked in pretty much any text file, so they could get the version control system to keep track of files' provenance for them. In particular, you could do this in a program (I'll show it in Python, but it works in any language):
But now we're using Git, and that doesn't work, because Git identifies files using hashes of their contents, and if you modify a string in a file, its hash changes, and if that happens during a commit, it can rupture the spacetime continuum. @jiffyclub wrote a blog post a while back about a workaround, but it's Python-specific, and a bit clumsy compared to the old SVN way of doing things. What can/should we teach people about using the version control system to do these kinds of things?
The text was updated successfully, but these errors were encountered: