-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for managing test baseline images using data version control (dvc) #5724
Comments
Based on the discussion at the last community meeting, I will start the migration of the baseline images to dvc using DAGsHub for storage. |
Cool, let me know if you need any help 😀 Just a note on storage limits. According to https://dagshub.com/plans, DAGsHub provides up to 10 GB of free space. So I think <200MB from GMT is ok for now (PyGMT probably has <15MB on DAGsHub), but just something to keep in mind when uploading those large PS and video files. |
Good, thanks. |
As an update, I have learned that dvc works best by tracking directories rather than individual files when large numbers of files need to be added (we currently have 779 .ps files). I am going to try out restructuring the tests so that rather than having .ps files paired with the .sh files in Since the DAGsHub interface supports viewing png files, I am also going to research whether using .png files rather than .ps files will impact performance. |
Does tracking a directory mean computing a hash for the whole directory? A bit concerned with what this means if different people are trying to modify different PS files on multiple branches. Edit: Looking at https://dvc.org/doc/command-reference/add#example-directory, it seems that running |
@PaulWessel, for PyGMT we bundle up the test images at release time and include that as an asset for the github and zenodo releases. Do you think this is desirable for GMT as well? Benefits -
Downsides
|
There are 104 MB of files in |
Yes. Not to go in the source tarballs, bundles, or windows installers. Just zip up that when we do a release and archive it somewhere. |
Right, backups are never a bad idea. The PS files should compress significantly. |
I agree, good think to do for self-preservation. |
Tracking directories has caused a lot of troubles for us recently. For example, all the PS files of the 52 examples are DVC-tracked in a single DVC file (i.e.,
The troubles are:
So, tracking directories is not a good choice for us. |
As the .dvc files are so small and we can always purge images from the dagshub repo, the only real risk here seems to be the amount of time it would take to try this and go back if necessary. I would guess it would take a couple hours of work to go from the current structure to tracking individual files, and likely about the same to go back if it turns out to be more of a headache. Seems worth trying IMO given the recent frustrations. |
Proposal for managing test baseline images using data version control (dvc)
This issue proposes a solution to #3470 and a partial solution to #2681 by using data version control to manage the baseline images for testing. @weiji14 led an effort to move PyGMT's tests from git version control to data version control with remotes stored on DAGsHub in GenericMappingTools/pygmt#1036; most of the information here is from Wei Ji's posts for PyGMT (thanks! 🙏 🎉 ).
Motivation for migrating baseline images to dvc
Here's the current breakdown for the GMT repository:
.git
: ~1.1 GB (up from ~720 MB on Feb. 06 2020)test
: ~115 MB (101 MB from PS files) (up from ~113 MB on Feb 06. 2020)doc
: ~68 MB (51 MB from PS files; 33 MB from PS indoc/examples
; 18 MB from PS indoc/scripts
) (down from ~70 MB on Feb. 06 2020)share
: ~13.5 MBsrc
: ~16 MBThe fact that the overall repository size increased by 50% over the past 1.5 years while individual directories have remained the same size supports past developer comments that the repository growth rate due to rewriting PS files is unsustainable.
What is data version control
Data version control (dvc) is an open source tool for managing and versioning datasets and models. It is built on Git with very similar syntax. Rather than storing bulky images in the repository, small .dvc files are stored that contain metadata, including the md5 hash for the data file. This allows versioning of data files that are stored in a remote location. Options for remote storage include S3, Google cloud, Azure, SSH server and DAGsHub (PyGMT uses DAGsHub).
Steps required
(Based on PyGMT, may need some updating)
Initial setup (only needs to be done once for the repository)
Installing DVC for developing GMT
BUILDING.md
.Initialize dvc
Setup DVC remote
Migrating tests
(based on PyGMT steps, may need updating)
Pull images from DVC remote (for GitHub Actions CI and local testing)
What about the images for documentation?
Test directory is currently much larger than the documentation directory. So, migrating the tests will be a large first step that does not require an established solution for the documentation images. Regardless, my opinion is that we should host the examples/tutorials/animations in a separate repository (#5364 (comment)).
References
Are you willing to help implement and maintain this feature? Yes
The text was updated successfully, but these errors were encountered: