Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perform a "backup" of a nested dataset hierarchy on a crippled-fs harddrive #62

Open
2 of 4 tasks
mih opened this issue May 8, 2023 · 6 comments
Open
2 of 4 tasks
Labels
support-tracker Track a support event that occurred elsewhere

Comments

@mih
Copy link
Contributor

mih commented May 8, 2023

Origin: DataLad office hour chat 2023-05-08

Basically I need to work remotely so trying to clone my entire dataset onto a harddrive and from the harddrive onto my personal laptop. While doing so, I thought it'd be smart to do a "backup" by getting all the content on the harddrive.

While performing a recursive get of a superdataset clone (multiple subdatasets) onto a crippledFS external harddrive, the user aborted the command and was left of modified dataset clones.

TODO (not necessarily to be performed in this order)

  • Inform OP/Add reference to this issue at origin
  • Clarifying Qs asked or not needed
  • Nature of the issue is understood
  • Inform OP about resolution

Capturing relevant pieces from my reply:

Instead of getting a nested hierarchy of a single version snapshot of your data, it would actually be a full backup (all data, all versions), and it would not suffer from the limitations of your hard-drive file system as much (unverified speculation).

The downside is that it won't look as pretty

But this is our standard solution for collaboration (push/pull) using a location that is not ready for git-annex

if you like papers more than online handbooks: https://doi.org/10.1038/s41597-022-01163-2

Roughly summarizing the difference between what you tried and what this different approach would mean:

  • you will not clone onto the harddrive, and get data onto it
  • instead you will (recursively) create dataset siblings on the (mounted) harddrive, and then push to it

This means you will work exclusively in your main dataset clone.

The resulting "RIA store" on the harddrive, can be added to other existing clones as a remote, and they will be able to pull data from it. You would be able to continue to push data (new versions) onto the drive, without having to replace/delete anything (until you run out of space)
(At which point you can detect and cleanup versions you no longer need).

RIA stores also support compressed archives -- so your harddrive might last for quite a bit

CAUTION: I am not aware of anyone having actually tried putting a RIA store on an external harddrive with a non-POSIX filesystem. I expect this to work, but there is no hard evidence for this claim.

@mih mih added the support-tracker Track a support event that occurred elsewhere label May 8, 2023
@mih mih changed the title Interruped recursive get on a crippled FS Perform a "backup" of a nested dataset hierarchy on a crippled-fs harddrive May 8, 2023
@mslw
Copy link
Contributor

mslw commented May 9, 2023

This had a follow up in the office hour chat and the office hour today. Out of multiple subdatasets, most were pushed to the RIA without an issue, but two did not:

Push to 'ria-backup':
CommandError: 'git -c diff.ignoreSubmodules=none annex copy --batch -z --to ria-backup-storage --fast --json --json-error-messages --json-progress -c annex.dotfiles=true' failed with exitcode 1 under /media/(--redacted--) [info keys: stdout_json]
> to ria-backup-storage...
  content changed while it was being sent
  This could have failed because --fast is enabled. [733 times]
git-annex: copy: 733 failed

There was an issue about the same error datalad/datalad#5613 which was solved by upgrading git-annex (to 10.20220128) - since the current issue was reported using an older version, we need to wait and see if an update solves the problem.

@jsheunis
Copy link
Contributor

jsheunis commented May 10, 2023

IMO there are two issues here:

  1. Perform a "backup" of a nested dataset hierarchy on a crippled-fs harddrive (defined by the issue title). I think @mih's response could be the basis of a KBI on this topic, supported by some code examples and ideally a confirmation that this all works on a non-POSIX external harddrive.
  2. the content changed while it was being sent issue, which is has a thorough writeup in the form of git-annex: content changed while it was being sent datalad/datalad#5613 and which ideally also solves this user's problem (UPDATE: user has confirmed that upgrading to the latest version of git-annex solved the problem)

@jsheunis
Copy link
Contributor

jsheunis commented May 10, 2023

Paraphrased steps followed by the user:

Creating the backup

  1. Remove any tried-but-failed clones (using methods other than RIA siblings) from the external hard-drive
  2. Create the RIA sibling (name: ria-backup, alias: ria-alias) on the external hard-drive (recursively, since the superdataset has nested subdatasets): datalad create-sibling-ria -s ria-backup --alias ria-alias --new-store-ok ria+file:///<path-to-location-on-external-hard-drive> -r
  3. Push content to the RIA sibling recursively: datalad push --to ria-backup -r

Cloning from the backup

  1. Connect the hard-drive to a machine on which to make a clone (e.g. pc, laptop)
  2. Clone from the RIA store: `datalad clone ria+file:////riastore#~ria-alias (JSH comment: is this correctly formatted?)
  3. To install subdatasets, get them as per usual: datalad get <relative-location-to-subdataset>

Fetching updates

  1. If the clone on a pc or laptop grows with commits that need to go back to the data origin via the external hard-drive, they can pushed to the hard-drive first: datalad push --to ria-backup
  2. Then connect the hard-drive to the original data source, and fetch/merge updates: datalad update --merge

@jsheunis
Copy link
Contributor

Most of the above is explained in https://handbook.datalad.org/en/latest/beyond_basics/101-147-riastores.html, but I think this compact use case can still stand on its own as a KBI.

@christian-monch
Copy link
Contributor

christian-monch commented May 11, 2023

More traffic on this issue today. datalad status generated the error: Unknown commit identifier: master was generated. Asked follow up questions to the OP, no answer yet.

I have a windows machine and will spend some time looking into ria-stores on NTFS

@jsheunis
Copy link
Contributor

More traffic on this issue today. datalad status generated the error: Unknown commit identifier: master was generated. Asked follow up questions to the OP, no answer yet.

User reported that this was no longer an issue for them (they didn't have to use that solution anymore, and they won't be spending time debugging it anymore). So for the purpose of solving the user's problem, this issue is not needed anymore. But for the purpose of writing a KBI, this issue can remain open, pending a test on a windows system and the KBI writeup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support-tracker Track a support event that occurred elsewhere
Projects
None yet
Development

No branches or pull requests

4 participants