Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emphasize provenance capture #294

Merged
merged 8 commits into from
Dec 10, 2019
Merged

Emphasize provenance capture #294

merged 8 commits into from
Dec 10, 2019

Conversation

adswa
Copy link
Contributor

@adswa adswa commented Dec 5, 2019

Hey @pvavra, here's a sketch for my take to fix #293. I'm introducing the datalad download-url in the second chapter, and elaborate on its provenance capture benefits in the chapter on collaboration, in the context of git annex whereis being introduced. I think it fits in well in the common theme of "lets start with a sensible thought progress and then improve upon it", without sacrificing the Basics chapter structure "local version control", "reproducible execution", "git annex", "collaboration", "configuration", "data analysis principles", yet it still shows the command early on and much more prominently. I think this solution comes close to the end-of-chapter section you preferred, and I'd appreciate your take on this. :)

I agree that overwhelming readers should be avoided. However, teaching "bad practices" (like not capturing the origin of the data during the populating step) should be an even more important aspect to consider, IMHO.

Just because I stumbled across this again, maybe three quick things I want to note (not as rebuttal, but to capture my thoughts after thinking about it):

  • provenance capture is not relevant for absolutely every use case (DataLad can just be used as a powerful local version control system instead of full data management multitool).
  • The first chapter serves to teach "local version control", later chapters can much better elaborate and demonstrate how provenance capture is beneficial once the basics of DataLad are known. (Therefore, it is unfeasible IMHO to replace the initial wget - this is indispensable to have something to explain datalad save on.)
  • The basics part of the book is more course-like and should be a learning experience, which IMHO works very well if readers are faced with sensible first attempts and can subsequently realize the added benefit of more advanced commands (e.g., the chapter on datalad run is written entirely like this), and I think adding the download-url command here fits in well. The usecase part of the book however shows fully thought-through workflows, e.g. on provenance capture

Todo

  • add to summary
  • general language tweaks

@adswa adswa changed the title Test Earlier introduction of datalad download-url Dec 5, 2019
@pvavra
Copy link

pvavra commented Dec 5, 2019

I like it.

Overall, I agree with the "lets start with a sensible thought progress and then improve upon it" principle.

However, let me make a small rebuttal on two of your thoughts ;)
(It does not change anything about this pull-request, so maybe we should have this conversation somewhere else..)

provenance capture is not relevant for absolutely every use case (DataLad can just be used as a powerful local version control system instead of full data management multitool).

This contradicts the "make each program do one thing well" (unix philosophy) which, I think, is implicitly part of datalad by building upon git and git-annex. I guess the question is whether one sees datalad as an "easy git annex wrapper", or as "the only way to do provenance tracking properly". Personally, I tend to think of it as the latter, because git and git-annex take care of version control.

This brings me to the second point:

The first chapter serves to teach "local version control", later chapters can much better elaborate and demonstrate how provenance capture is beneficial once the basics of DataLad are known.

Based on the above line of reasoning, I would say that "local version control" should be replaced with "local provenance control". If one is interested in only tracking versions of files (large or small) one does not need datalad (albeit it is more convenient).

The "populate" chapter could be restructured to introduce two different ways of adding files:

  1. typing them (i.e. coming from "outside the computer") --> introduce save
  2. adding files from somewhere else (and then logging their source). --> introduce download-url

and as a logical progression, the next chapter on installing datasets can be framed then as
3. adding pre-existing datalad-datasets

Note that for step 2., one could start with wget and then explain why download-url is superior (i.e. provenance).

Based on my somewhat limited experience in trying to explain the "why" of version control to people who are not used to track anything (beyond some changes in word, maybe), I guess that convincing people that (a) version control is a sensible thing to do, or (b) skipping the "pure" version control step and to directly go for full provenance capture will not be much different. The jump in practices and though-patterns from "nothing" to "version control" is substantially larger than the difference between "version control" and "provenance capture". In addition, the version control step is to me only a means to an end, namely provenance capture.

In a sense, I'm saying that teaching these concepts does not have to follow their developmental chronology (git -> git-annex -> datalad). Instead, teaching should follow their logical structure. And here, I think that provenance capture should be the start, as version control is a tool for achieving that. But again, this depends on my preceding point that I view datalad's purpose to be provenance capture, not a "fancy" git-annex wrapper.

@mih
Copy link
Collaborator

mih commented Dec 7, 2019

Thanks a lot for your thoughts @pvavra ! I deeply appreciate this kind of conceptual reflection!

I will shed a little more light on the development process of both DataLad itself and the handbook.

(It does not change anything about this pull-request, so maybe we should have this conversation somewhere else..)

I cannot think of a better place right now, so let's keep it here for now.

provenance capture is not relevant for absolutely every use case (DataLad can just be used as a powerful local version control system instead of full data management multitool).

This contradicts the "make each program do one thing well" (unix philosophy) which, I think, is implicitly part of datalad by building upon git and git-annex. I guess the question is whether one sees datalad as an "easy git annex wrapper", or as "the only way to do provenance tracking properly". Personally, I tend to think of it as the latter, because git and git-annex take care of version control.

Being responsible for some of the status quo of datalad, I can confirm that it is riding on the edge of feature creep. I cannot say that I want it to just "do one thing well", simply because I lack the creativity to see who we can just do one thing and still achieve the desired functionality (which is a one-stop-shop for research data management).

DataLad is in a consolidation period that is trying to shrink the API and define desirable workflows more concretely. Particular functionality is removed or moved into extensions in order to get closer to "one piece one function". The handbook is a reflection of the conceptual side and current state of this effort.

The first chapter serves to teach "local version control", later chapters can much better elaborate and demonstrate how provenance capture is beneficial once the basics of DataLad are known.

Based on the above line of reasoning, I would say that "local version control" should be replaced with "local provenance control". If one is interested in only tracking versions of files (large or small) one does not need datalad (albeit it is more convenient).

Ignoring the specifics of your proposal (for now), let me say this: It is a radical approach. I like it. I like radical ;-)

Why is it radical? It is because the community we are operating in largely believes that "first you learn Git". We are trying hard to avoid that (within the limits of what current datalad has to offer). We are trying to avoid that, because I have personally seen this approach fail again and again when applied to an audience that isn't really interested in "learning version control", because they don't acknowledge the advantages or even the need for it, and because their exposure time to DataLad and the workflows that it enables is too short to make the difference.

The idea is to start with something that is the smallest possible step (in terms of impact on existing workflows, and conceptual complexity) that teleports people into the datalad universe and sets them up for a manageable and incremental learning experience when they are ready for it -- without having to introduce substantial post-hoc changes to their data structures.

Going straight up to local provenance capture would indeed take the shortest path to the holy grail of features (reproducible computation), but conceptually is much more complex than communicating: At the start run datalad create and then, whenever you are on a meaningful next step run datalad save.

Because reproducibility is a juicy reward, our datalad course starts with http://handbook.datalad.org/en/latest/usecases/reproducible-paper.html as a teaser. But I am not convinced that changing the book to start with something like this would be a net positive.

Based on my somewhat limited experience in trying to explain the "why" of version control to people who are not used to track anything (beyond some changes in word, maybe), I guess that convincing people that (a) version control is a sensible thing to do, or (b) skipping the "pure" version control step and to directly go for full provenance capture will not be much different. The jump in practices and though-patterns from "nothing" to "version control" is substantially larger than the difference between "version control" and "provenance capture". In addition, the version control step is to me only a means to an end, namely provenance capture.

I see your point (again "radical" and tempting), but I cannot agree based on what I see happening in reality. I think it is suboptimal to focus on "convincing". What are we trying to convince people of? "use version control"? "use datalad"? "your research must use full provenance capture"? People who are trying out datalad are either forced to, or have already convinced themselves that they need something like it, and "only" need to learn how it works. Convincing the former is extremely hard (and IMHO they are better left to convince themselves over time). Convincing the latter is not necessary.

Instead, we are trying to focus on "practicality", i.e. here is how little you need to understand/change in order to be ready for more, when you are ready for more. In my experience, to majority of people want to do things well (or better, or right, ...), most people are willing to invest some time, some people can actually afford to invest a sufficient amount of time, and few people manage to adopt new technology in full. It is a fact IMHO that the surrounding environment (incentives, PI demands, etc.) is not pushing students to use this kind of technology. It is the students that are wanting to adopt better workflows. Our approach tries to acknowledge this discrepancy and make a (admittedly less radical) move into the right direction. Taking this approach is a deliberate choice, of course.

In a sense, I'm saying that teaching these concepts does not have to follow their developmental chronology (git -> git-annex -> datalad). Instead, teaching should follow their logical structure. And here, I think that provenance capture should be the start, as version control is a tool for achieving that. But again, this depends on my preceding point that I view datalad's purpose to be provenance capture, not a "fancy" git-annex wrapper.

I, again, agree and disagree. 100% agreement on the chronology not being important. If you see that being implemented, we have an issue, and it would be good to understand why you feel this way. DataLad enables VCS workflows that are not possible/easy with git/git-annex (dataset nesting, transparent operation on dataset hierarchies). These are key features without which full provenance capture would not be possible. The exposure to git/git-annex in those chapters is intended to be limited to understanding the resulting data structures. The amount of information on git/git-annex in them is informed by empirical observations on how people fail when switching from plain project directories to datalad datasets as the basis of their work. This information should be reduced over time (I very much agree), but datalad is not "done", and we have to acknowledge the realities of its present feature set.

@adswa
Copy link
Contributor Author

adswa commented Dec 7, 2019

Thanks a lot for your thoughts, @pvavra!

You're raising interesting points. I can see where they come from, and I'll keep them in the back of my head to monitor whether other's independent feedback arrives at similar suggestions or conclusions.
As far as I can see it, one core aspect of this discussion is "what is DataLad". You're arriving at the conclusion that DataLad is "the only way to do provenance tracking properly" instead of "easy git annex wrapper", and I think that this is certainly one sensible conclusion to arrive at (but one of many¹). And, just to make it clear, I think this is also a conclusion one could arrive at after having read the book (we're not hiding the provenance aspect, there are full chapters on it). I appreciate this perspective a lot, and I think it is beneficial if I emphasize this in the book a bit more, where applicable. Personally, though, I think simply emphasizing existing contents and adding small-ish notes/additions into existing structures (like this PR does) would go a long way, and I'm reluctant to restructure the first chapters to a separate/parallel "hard focus" on provenance. This is based on two things, one my own learning experience with DataLad, and the other the experiences we have made so far with teaching the first chapter.

Before I elaborate what I mean, fundamentally, I think both reasons strongly depend on the starting point in skill set and prior knowledge (about general computer knowledge, Git or VCS in general, FAIR data, and whether there is a concrete use case for DataLad or if one just wants to learn how it can be used) an average reader of the Basics would have. One of the declared objectives of the handbook is to be accessible regardless of background and skill level, and for this I draw a lot from my own learning experiences (having started without any knowledge at all - not knowning anything about VCS, provenance, programming, ...). In this, I'm seconding @mih's note on "practicality". Speaking from my own experiences when learning it as a complete novice, there is a lot to unwrap and understand. Personally, it took me a very long time to understand what "provenance" encompasses, and many own experiences to see how and why it is useful. Although I was introduced to it very very early, it took a lot to understand and appreciate it ( datalad run was the first command ever @mih taught me, and he showed and told me that everything was captured and recorded so I wouldn't need to take notes - I continued to frantically take notes, because I had no understanding of what this would mean). I therefore think that the simplicity of datalad create + datalad save is the gentlest possible introduction to this complex tool, with immediate "rewards" and demonstrations of usefulness in a domain that is known to everyone (get/create files, save them while keeping a record), and its possible to wrap them up in a stand-alone usage scenario (local VCS). Explaining what provenance is (and it takes quite a bit of background for theory, a number of (git annex) commands (e.g. git annex whereis, datalad get (but before this, datalad drop -- but before that, a rudimentary understanding of difference in files in Git vs. in object tree to not be completely confusing)) to see how/that it works, and a concrete use case to understand its usefulness) on top of that is hard -- it adds a dimension in complexity, potentially for complete novices, and a dimension in usage/application. But again, I agree that adding download-url here and giving a sneak-peak of what it can do is very beneficial and primes what is to come -- I'm just not convinced that making a very strong point on provenance so early eases understanding.
The second reason is less anecdotal and based on the workshops and lectures/tutorials I have given so far (as @mih says, we start with a fully reproducible paper, then basics of datasets, then reproducible execution, ....). The first chapter is easy to motivate, not complex to teach, and elevates the workflows of attendees from "I have 3 project directories with 2 different copies of the data but the same names, and my code is in 4 copies of the same file in different development phases" to "I have one directory, my data and files are version controlled (even though I have never heard of Git or VCS before)".
Its fast, reinforcing, and has so far been clear to every single person I have taught. Once this is understood, and there is a general motivation for learning DataLad, its way easier to improve and extend DataLad skills. In a way, I'd rather build up a base slowly but with high chances of taking most people with me and then building up on it, than trying to fit in too much too early too comprehensive to give everyone a chance.

I will extend this PR with more emphasis on provenance aspects where applicable within the existing structure of the book, for now. I certainly keep your take on it in mind, and will seek input from others on it.

¹ Tbh in the beginning, when writing, I had the strong urge to also dive into the VCS capabilities in datasets and elaborate on how one can work with the history (because, personally, that was my take on whats a major cool feature I want to show here), but in the end I exiled this to the end of the book.

@adswa
Copy link
Contributor Author

adswa commented Dec 7, 2019

hey @all-contributors, please add @pvavra for ideas and userTesting

@allcontributors
Copy link
Contributor

@adswa

I've put up a pull request to add @pvavra! 🎉

@adswa adswa changed the title Earlier introduction of datalad download-url [WIp] Earlier introduction of datalad download-url Dec 9, 2019
@adswa adswa changed the title [WIp] Earlier introduction of datalad download-url [WIP] Earlier introduction of datalad download-url Dec 9, 2019
@adswa adswa changed the title [WIP] Earlier introduction of datalad download-url Emphasize provenance capture Dec 10, 2019
@adswa
Copy link
Contributor Author

adswa commented Dec 10, 2019

I gave the Basics a full read (possible in a single train ride!) and emphasized or added where appropriate. The conflicts are resolved now, as well. Will merge this. :)

@adswa adswa merged commit bb2b5c7 into master Dec 10, 2019
@adswa adswa deleted the test branch December 10, 2019 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When populating a dataset, how to log where data came from when source not a 'dataset'
3 participants