Replies: 5 comments
-
A couple of thoughts:
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I can't see how the failure of ML teams to embrace best practices is DVC's issue. You must manage code and data. Period. DVC has literally made this as easy as possible (given their treatment as separate but associated 'things'). I just can't see how it could get simpler. |
Beta Was this translation helpful? Give feedback.
-
I agree if we are talking about the entire process. But in some specific cases, we can simplify the experience. |
Beta Was this translation helpful? Give feedback.
-
It might be related to data\remote types feature #4040 since the self-contained dvc-file is kind of its own data\remote type. |
Beta Was this translation helpful? Give feedback.
-
Motivation
Right now DVC works only within Git repositories. It forces users to follow the best engineering practices. As we know, some data scientists are not ready to invest resources in a formal code versioning methodology and use Git properly. This limits them not only from data versioning but also from other huge DVC benefits - data transferring.
DVC can provide a holistic way of data codification and data transferring functionality without connection to Git or other source code versioning systems.
References
The proposal is based on the idea of self-contained DVC-files that we discussed with some DVC core team members.
This approach has some similarities with dvc-metrics without DVC. It is a requirement from CML-like scenarios: #4446 (comment)
Feedback from users' discussions. Some DVC users mentioned that their team members are not fluent with Git and it prevents the team from fully move to DVC. The other users say that the duality of DVC and Git make the tool a bit too complicated - command duplications (git-hooks is a good workaround but not a complete solution) and the concept of versioning becomes more complicated.
This is related to our discussions about datasets and storage improvements - #1487. Some of the functionality from the dataset discussion was not implemented because we found that Git has a good and more holistic approach. However, this conclusion is based on the assumption that users are comfortable using Git. We challenge this assumption in the context of some ML teams that are not ready to embrace the best engineering practices.
The experiment logging tools use simple model versioning as python API. While general feedback about this method is not consistent and usage is not high (compared to the metrics logging functionality of these tools), some teams like this simple approach of ML model versioning.
Vision
Does it mean that DVC Lite can replace DVC? Absolutely not. DVC Lite is a lightweight workaround for some pinpoint problems. It should be a good fit for ML teams that are not ready to fully embrace the best engineering practices but still need basic data and model versioning functionality. With ML process maturity more and more teams will be moving to the proper Git-based flow for model and data versioning. The proper Git-based flow of data versioning has great benefits and support on the infrastructure side - GitHub/GitLab/BitBucket with all the features.
Ideas about the implementation
The data transferring part can be completely borrowed from the core DVC. However, we need to come up with a format of metafiles that describes all the information needed for data transferring: file names, versions, data cache dir (if exist), data remotes.
The codification (metafiles) part can be implemented by DVC-file-like metafiles (ideally compatible with the original DVC files). The absence of Git requires an introduction of some new concepts for tracking and navigating among the versions. Some ideas:
production
andstaging
models in mlflow.Examples
Regarding the PIA - it might be a separate set of the commands
dvclite
or it might be hidden under thedvc
umbrella.Example 1: Basic model versioning
A user just does not use Git but still needs to track file versions and transfer the files between machines or storages.
When file changed:
The metafile (
segm/model.h5.dvc
) should contain all the versions. Get the old model:Get back recent one:
Remote in metafiles:
Data transfering:
Example 2: Cache in the cloud (no local cache)
Save all version in data remote by default
RCF
Please provide your comments and feedback. Any comments and suggestions are welcome.
@iterative/engineering
Beta Was this translation helpful? Give feedback.
All reactions