Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling back file versions with the same filename can break characterization and derivative jobs #5676

Closed
conorom opened this issue Jun 13, 2022 · 0 comments · Fixed by #5694
Assignees
Labels

Comments

@conorom
Copy link
Contributor

conorom commented Jun 13, 2022

Descriptive summary

This almost certainly affects all versions of Hyrax. For sure from 2.9.0 to 3.4.1 (and main branch).

Rationale

Rolling back to an older version (a.k.a. revision) of a FileSet is the only place to call CharacterizeJob or CreateDerivativesJob without a filepath, meaning it's the only place (outside of one occurrence in rake tasks) that causes Hyrax::WorkingDirectory to pull a copy of the file from the repository to a NOID-based-pairtree folder inside working_path.

aside: The whole WorkingDirectory thing was slated for removal in a TODO left on this PR. It says to use JobIOWrapper instead. I may spin off another ticket for that after this as it may be sort of forgotten at this point.

So the WorkingDirectory.find_or_retrieve() method relies on the filename to decide whether the version that has just been "rolled back to" is the one that's already cached in said directory. Any old version that may be cached in the working_path will be used if the name matches the current version's original_name. These are never cleared out by the system itself. Admittedly we do delete uploaded files periodically in heliotrope, and perhaps this is recommended in a Hyrax setup Wiki somewhere. Not sure.

It may seem unlikely that two rollbacks would occur through the UI where both versions have the same filename. But of course it's very likely that the same name needs to be used, if it's pertinent to the content (some sort of ID, or in our case a book ISBN). And, as mentioned, other calls that might be made to CharacterizeJob or CreateDerivativesJob from elsewhere, with no filepath parameter will cause this problem too. Like a dev working in the console or triggering a rake task. The task linked above would cache a working_path file for every FileSet in the system.

Expected behavior

Nothing in the UI should cause CharacterizeJob or CreateDerivativesJob to run on a file that is not the FileSet's current version.

Actual behavior

CharacterizeJob or CreateDerivativesJob will run on a file that is not the FileSet's current version if you ever roll back to different versions with the same name.

Steps to reproduce the behavior

  1. Upload an image FileSet to a Work. Let the jobs finish and note the thumbnail, file size and checksum in the UI
  2. Upload a new version to the FileSet. Something with the same filename but a different image and size. Again, note the characterization metadata in the UI.
  3. In the versions tab, revert the FileSet to the first version. Allow jobs to finish. The thumbnail and metadata will be correct. Note that this is where the working_directory copy was made.
  4. Now revert to the second version. Allow jobs to finish. The thumbnail and characterization is done on the wrong file, the one cached to disk in step 3.

Related work

TODO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant