-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emphasize provenance capture #294
Conversation
I like it. Overall, I agree with the "lets start with a sensible thought progress and then improve upon it" principle. However, let me make a small rebuttal on two of your thoughts ;)
This contradicts the "make each program do one thing well" (unix philosophy) which, I think, is implicitly part of This brings me to the second point:
Based on the above line of reasoning, I would say that "local version control" should be replaced with "local provenance control". If one is interested in only tracking versions of files (large or small) one does not need datalad (albeit it is more convenient). The "populate" chapter could be restructured to introduce two different ways of adding files:
and as a logical progression, the next chapter on installing datasets can be framed then as Note that for step 2., one could start with Based on my somewhat limited experience in trying to explain the "why" of version control to people who are not used to track anything (beyond some changes in word, maybe), I guess that convincing people that (a) version control is a sensible thing to do, or (b) skipping the "pure" version control step and to directly go for full provenance capture will not be much different. The jump in practices and though-patterns from "nothing" to "version control" is substantially larger than the difference between "version control" and "provenance capture". In addition, the version control step is to me only a means to an end, namely provenance capture. In a sense, I'm saying that teaching these concepts does not have to follow their developmental chronology (git -> git-annex -> datalad). Instead, teaching should follow their logical structure. And here, I think that provenance capture should be the start, as version control is a tool for achieving that. But again, this depends on my preceding point that I view datalad's purpose to be provenance capture, not a "fancy" git-annex wrapper. |
Thanks a lot for your thoughts @pvavra ! I deeply appreciate this kind of conceptual reflection! I will shed a little more light on the development process of both DataLad itself and the handbook.
I cannot think of a better place right now, so let's keep it here for now.
Being responsible for some of the status quo of datalad, I can confirm that it is riding on the edge of feature creep. I cannot say that I want it to just "do one thing well", simply because I lack the creativity to see who we can just do one thing and still achieve the desired functionality (which is a one-stop-shop for research data management). DataLad is in a consolidation period that is trying to shrink the API and define desirable workflows more concretely. Particular functionality is removed or moved into extensions in order to get closer to "one piece one function". The handbook is a reflection of the conceptual side and current state of this effort.
Ignoring the specifics of your proposal (for now), let me say this: It is a radical approach. I like it. I like radical ;-) Why is it radical? It is because the community we are operating in largely believes that "first you learn Git". We are trying hard to avoid that (within the limits of what current datalad has to offer). We are trying to avoid that, because I have personally seen this approach fail again and again when applied to an audience that isn't really interested in "learning version control", because they don't acknowledge the advantages or even the need for it, and because their exposure time to DataLad and the workflows that it enables is too short to make the difference. The idea is to start with something that is the smallest possible step (in terms of impact on existing workflows, and conceptual complexity) that teleports people into the datalad universe and sets them up for a manageable and incremental learning experience when they are ready for it -- without having to introduce substantial post-hoc changes to their data structures. Going straight up to local provenance capture would indeed take the shortest path to the holy grail of features (reproducible computation), but conceptually is much more complex than communicating: At the start run Because reproducibility is a juicy reward, our datalad course starts with http://handbook.datalad.org/en/latest/usecases/reproducible-paper.html as a teaser. But I am not convinced that changing the book to start with something like this would be a net positive.
I see your point (again "radical" and tempting), but I cannot agree based on what I see happening in reality. I think it is suboptimal to focus on "convincing". What are we trying to convince people of? "use version control"? "use datalad"? "your research must use full provenance capture"? People who are trying out datalad are either forced to, or have already convinced themselves that they need something like it, and "only" need to learn how it works. Convincing the former is extremely hard (and IMHO they are better left to convince themselves over time). Convincing the latter is not necessary. Instead, we are trying to focus on "practicality", i.e. here is how little you need to understand/change in order to be ready for more, when you are ready for more. In my experience, to majority of people want to do things well (or better, or right, ...), most people are willing to invest some time, some people can actually afford to invest a sufficient amount of time, and few people manage to adopt new technology in full. It is a fact IMHO that the surrounding environment (incentives, PI demands, etc.) is not pushing students to use this kind of technology. It is the students that are wanting to adopt better workflows. Our approach tries to acknowledge this discrepancy and make a (admittedly less radical) move into the right direction. Taking this approach is a deliberate choice, of course.
I, again, agree and disagree. 100% agreement on the chronology not being important. If you see that being implemented, we have an issue, and it would be good to understand why you feel this way. DataLad enables VCS workflows that are not possible/easy with git/git-annex (dataset nesting, transparent operation on dataset hierarchies). These are key features without which full provenance capture would not be possible. The exposure to git/git-annex in those chapters is intended to be limited to understanding the resulting data structures. The amount of information on git/git-annex in them is informed by empirical observations on how people fail when switching from plain project directories to datalad datasets as the basis of their work. This information should be reduced over time (I very much agree), but datalad is not "done", and we have to acknowledge the realities of its present feature set. |
Thanks a lot for your thoughts, @pvavra! You're raising interesting points. I can see where they come from, and I'll keep them in the back of my head to monitor whether other's independent feedback arrives at similar suggestions or conclusions. Before I elaborate what I mean, fundamentally, I think both reasons strongly depend on the starting point in skill set and prior knowledge (about general computer knowledge, Git or VCS in general, FAIR data, and whether there is a concrete use case for DataLad or if one just wants to learn how it can be used) an average reader of the Basics would have. One of the declared objectives of the handbook is to be accessible regardless of background and skill level, and for this I draw a lot from my own learning experiences (having started without any knowledge at all - not knowning anything about VCS, provenance, programming, ...). In this, I'm seconding @mih's note on "practicality". Speaking from my own experiences when learning it as a complete novice, there is a lot to unwrap and understand. Personally, it took me a very long time to understand what "provenance" encompasses, and many own experiences to see how and why it is useful. Although I was introduced to it very very early, it took a lot to understand and appreciate it ( I will extend this PR with more emphasis on provenance aspects where applicable within the existing structure of the book, for now. I certainly keep your take on it in mind, and will seek input from others on it. ¹ Tbh in the beginning, when writing, I had the strong urge to also dive into the VCS capabilities in datasets and elaborate on how one can work with the history (because, personally, that was my take on whats a major cool feature I want to show here), but in the end I exiled this to the end of the book. |
hey @all-contributors, please add @pvavra for ideas and userTesting |
I've put up a pull request to add @pvavra! 🎉 |
I gave the Basics a full read (possible in a single train ride!) and emphasized or added where appropriate. The conflicts are resolved now, as well. Will merge this. :) |
Hey @pvavra, here's a sketch for my take to fix #293. I'm introducing the
datalad download-url
in the second chapter, and elaborate on its provenance capture benefits in the chapter on collaboration, in the context ofgit annex whereis
being introduced. I think it fits in well in the common theme of "lets start with a sensible thought progress and then improve upon it", without sacrificing the Basics chapter structure "local version control", "reproducible execution", "git annex", "collaboration", "configuration", "data analysis principles", yet it still shows the command early on and much more prominently. I think this solution comes close to the end-of-chapter section you preferred, and I'd appreciate your take on this. :)Just because I stumbled across this again, maybe three quick things I want to note (not as rebuttal, but to capture my thoughts after thinking about it):
wget
- this is indispensable to have something to explaindatalad save
on.)datalad run
is written entirely like this), and I think adding thedownload-url
command here fits in well. The usecase part of the book however shows fully thought-through workflows, e.g. on provenance captureTodo