-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
716 more Hugging Face datasets can be read by mlcroissant. #532
Conversation
- Readability improvement: we remove the complex function `last_operations`. - Correctness: the operation graph has less bugs (see u.a. "huggingface-c4"). - Performance improvement: we remove a Dijkstra+for-loop (= O(n^2)) for a hashmap storing the last operations (= O(1)). Example of dataset that used to timeout and is now usable: ```python import mlcroissant as mlc mlc.Dataset("https://datasets-server.huggingface.co/croissant?dataset=gcaillaut/citeseer") ```
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
continue | ||
return list(operations) | ||
"""Overwrites nx.DiGraph to keep track of last operations. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or "pointer to the chain of last operations"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
): | ||
"""Adds all operations for a node of type `Field`. | ||
"""Adds all operations for a node of type `RecordSet`. | ||
|
||
Operations are: | ||
|
||
- `Join` if the field comes from several sources. | ||
- `ReadFields` to specify how the fields are read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Data
operation here?
Also, change "if the field" with "if any of the fields related to the recordset"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the description because it doesn't bring anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's really useful, thanks!
NIT: I understand there is probably no space in this PR, but we could add a few tests for the new functions?
Thanks for the review! This is a refactoring of the existing code, so the current tests still hold. |
Readability improvement: we remove the function
last_operations
.Correctness: the operation graph has less bugs (see u.a. "huggingface-c4" or "coco204-mini"):
O(n^2)
) in profit of a hashmap storing the last operations (O(1)
).Example of dataset that used to timeout and is now usable:
More than 700 Hugging Face datasets were similarly impacted (see the announcement).
Fixes: #310 and #525.