Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to disable checksums for external files and blobs #1025

Merged
merged 32 commits into from
Jun 13, 2022

Conversation

jverswijver
Copy link
Contributor

@jverswijver jverswijver commented May 17, 2022

Adds an option to disable checksums when calling fetch or fetch1.
This has big implications when fetching large(100gb+) files as then can take a long time to verify the checksum. I also added a warning/confirm message that indicates to the user that they are losing the guarantee that their data is unmodified. You can disable the warning by setting dj.config['safemode']=False

here is an example of the warning prompt and safemode usage:

I have implemented a solution to disable checksums when the filesize goes over a certain threshold.
The way to use this is to set filepath_checksum_size_limit in you DJ config to the size threshold that you want in bytes.
It defaults to None and will always use checksums when none.

@dimitri-yatsenko dimitri-yatsenko changed the title Add option to disable checksums Add option to disable checksums for external files and blobs May 17, 2022
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

work is needed

datajoint/fetch.py Outdated Show resolved Hide resolved
datajoint/fetch.py Outdated Show resolved Hide resolved
datajoint/fetch.py Outdated Show resolved Hide resolved
datajoint/fetch.py Outdated Show resolved Hide resolved
datajoint/fetch.py Outdated Show resolved Hide resolved
@dimitri-yatsenko
Copy link
Member

Perhaps a better way would be to switch checksums globally rather than in each fetch call.

@guzman-raphael
Copy link
Collaborator

@dimitri-yatsenko What if we added a dj.config setting that takes a list of tables to disable checksum verification on? That way the code can remain the same though the config will persist but users won't just naively apply it everywhere. We can start ignoring it after a certain release (with a warning initially). That way we have a plan to enforce (again) the checksums once it has been properly optimized.

The main reason we had included this at the scope of each fetch is so the user will do a value assessment to determine if a particular fetch could really use the benefit of disabling checksums as opposed to a blanket application everywhere. This could still protect many other external fetches if the data is <10GiB. We will eventually upgrade the efficiency of checksums; it will just take some time to get it right so this is a stopgap since urgent improvement is necessary.

@dimitri-yatsenko
Copy link
Member

I agree that a global dj.config setting separate from safemode would be appropriate for checksum verification on fetch.

@guzman-raphael
Copy link
Collaborator

How about to summarize:

  • We'll introduce a new dj.config key (disable_fetch_file_verification). It's value is a list of full table names i.e. [`schema`.`table`]
  • This will only affect filepath@store attributes. This is easier since attachment and filepath checksums are handled in 2 different places (fetch.py:_get, external.py:download_filepath) and it will get messy. Trying not to refactor since we are thinking of a new solution to supersede external storage anyway.
  • We'll need to add a new kwarg to download_filepath so that it knows the table name for the fetch that is triggering the download and we can handle it all in one place. Pass the table name associated with the file download to download_filepath so that it can perform the validation against the config.
  • If this key exists and the table associated with the file download is in the list, then disable the checksum validation and throw a warning during the download step.

@dimitri-yatsenko Thumbs up if you agree and let me know if you want to catch up on this.

@dimitri-yatsenko
Copy link
Member

dimitri-yatsenko commented May 17, 2022

@guzman-raphael

  1. Perhaps the setting should be global rather than doing this for each file. Or perhaps we could do this for specific schemas.
  2. Can we use a positive formulation, e.g. disable_fetch_file_verification -> verify_file_checksum_on_fetch. Then statements with double negatives such as if not disable_fetch_file_verification: become a positive if verify_file_checksum_on_fetch.
  3. I agree to only doing this for filepath. For attach, different files can have the same name, so checksum becomes a file identification mechanism, not just integrity check.

@guzman-raphael
Copy link
Collaborator

guzman-raphael commented May 18, 2022

@dimitri-yatsenko

  1. I disagree with this only because I would still discourage usage of disabling checksums. I'm normally in favor of simplicity but here it would cost us disabling this integrity check everywhere. There are probably only a handful of tables tracking such large raw data, right?

  2. This would be nice but don't think it would be compatible with 1 as it would be cumbersome to explicitly list all tables. The check could be if table in disable_fetch_file_verification or checksum().

  3. 👍

@guzman-raphael
Copy link
Collaborator

@dimitri-yatsenko and I met yesterday and discussed this. We converged on a simpler approach that could address the issue more directly.

To summarize:

  • Add a new global config to dj.config that will specify a file size limit where checksums will only be evaluated for fetch of filepath files less than that size.
  • Perhaps it could be called filepath_checksum_size_limit. It could take its value as an integer representing the bytes.
  • None (default) would indicate no limit i.e. perform the checksum for all filepath fetches.
  • 0 would be discouraged but could basically disable fetch checksums for all filepath.
  • Throw a warning if a checksum is skipped due to this setting.
  • We'd recommend that for users interested in using this stopgap (until further optimization is developed) to prevent verification, they should do this at their own risk and ideally tune it to a reasonable size that is affecting performance for their case.

Copy link
Collaborator

@guzman-raphael guzman-raphael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jverswijver ✨ Great work! This is coming together nicely. ✨

Suggesting minor points to simplify some sections.

Additionally, could you:

  • Add changelog/releaselog
  • Add a test

datajoint/external.py Outdated Show resolved Hide resolved
datajoint/external.py Outdated Show resolved Hide resolved
datajoint/external.py Outdated Show resolved Hide resolved
datajoint/external.py Outdated Show resolved Hide resolved
@jverswijver jverswijver marked this pull request as draft May 25, 2022 15:19
@jverswijver jverswijver marked this pull request as ready for review May 27, 2022 15:28
@jverswijver
Copy link
Contributor Author

Now relies on #1031 to be merged first. This is due to needing to use the new logger to test if a warning was thrown when disabling checksums.

Copy link
Collaborator

@guzman-raphael guzman-raphael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jverswijver 👏 Nicely done! 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up retrieval of files from stores Option to skip checksum when fetching filepath
5 participants