-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add repo_id to DatasetInfo #6268
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
In #4129 we want to track the origin of a dataset, e.g. if it comes from multiple datasets. I think it's out of scope of DatasetInfo alone, which has info for one dataset only. IMO if we want to track multiple origins we will need a new DatasetInfo that would have fields relevant to a mix of datasets (out of scope of this PR) cc @mariosasko I'd like your opinion on this |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Really happy to see this! It could also be helpful to track some other metadata about how the dataset was built in the future. i.e. for the Stack loaded like this:
It could be helpful to have easy access to the |
Perhaps it is also interesting to track the revision? I suppose the version also kind of covers that. That said, this is looking great already! I'm quite excited about this. Losing the |
One other thought. Is it worth tracking if a The Hub ID for private datasets could in some cases contain information someone wouldn't want to make public i.e. Adding a bool like if ds.is_private and not push_hub_id_for_private_ds:
ds_name = None Potentially this is overkill but could be useful for downstream libraries who might use this information for creating automatic model cards. |
We should probably find a way to remove |
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
related to #4129
TODO: