-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support custom fingerprinting with Dataset.from_generator
#6194
Comments
The I agree it should be easier to bypass the hashing mechanism in this instance, too. However, we should probably first address #5080 before solving this (e.g., maybe exposing |
Adding +1 here: If the generator needs to access some external resources or state, then it's not always straightforward to make it pickle-able. So I'd like to be able to override how the default cache key derivation needs to pickle the generator (and of course, I'd accept responsibility for that part of cache consistency). Appears to be a recurrent roadbump: #6118 #5963 #5819 #5750 #4983 |
Silly hack incoming: import uuid
class _DatasetGeneratorPickleHack:
def __init__(self, generator, generator_id=None):
self.generator = generator
self.generator_id = (
generator_id if generator_id is not None else str(uuid.uuid4())
)
def __call__(self, *args, **kwargs):
return self.generator(*kwargs, **kwargs)
def __reduce__(self):
return (_DatasetGeneratorPickleHack_raise, (self.generator_id,))
def _DatasetGeneratorPickleHack_raise(*args, **kwargs):
raise AssertionError("cannot actually unpickle _DatasetGeneratorPickleHack!") Now |
I'd like some way to do this too. I find that sometimes the hash doesn't cover enough, and that the dataset is not regenerated even when underlying data has changed, and by supplying a custom fingerprint I could do a better job of controlling when my dataset is regenerated. |
I ran into the same thing - my actual generator reads from a disk source that might have new data (images) available at some point and it ends up ignoring calling the generator. Thanks for the hack @mlin 👋 |
Feature request
When using
Dataset.from_generator
, the generator is hashed when building the fingerprint. Similar to.map
, it would be interesting to let the user bypass this hashing by accepting afingerprint
argument to.from_generator
.Motivation
Using the
.from_generator
constructor with a non-picklable generator fails. By accepting afingerprint
argument to.from_generator
, the user would have the opportunity to manually fingerprint the dataset and thus bypass the crash.Your contribution
If validated, I can try to submit a PR for this.
The text was updated successfully, but these errors were encountered: