Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

716 more Hugging Face datasets can be read by mlcroissant. #532

Merged
merged 2 commits into from
Feb 20, 2024

Conversation

marcenacp
Copy link
Contributor

@marcenacp marcenacp commented Feb 19, 2024

  • Readability improvement: we remove the function last_operations.

  • Correctness: the operation graph has less bugs (see u.a. "huggingface-c4" or "coco204-mini"):

image
  • Performance improvement: we remove a Dijkstra+for-loop (O(n^2)) in profit of a hashmap storing the last operations (O(1)).

Example of dataset that used to timeout and is now usable:

import mlcroissant as mlc
mlc.Dataset("https://datasets-server.huggingface.co/croissant?dataset=gcaillaut/citeseer")

More than 700 Hugging Face datasets were similarly impacted (see the announcement).

Fixes: #310 and #525.

- Readability improvement: we remove the complex function `last_operations`.

- Correctness: the operation graph has less bugs (see u.a.
  "huggingface-c4").

- Performance improvement: we remove a Dijkstra+for-loop (= O(n^2)) for
  a hashmap storing the last operations (= O(1)).

Example of dataset that used to timeout and is now usable:

```python
import mlcroissant as mlc
mlc.Dataset("https://datasets-server.huggingface.co/croissant?dataset=gcaillaut/citeseer")
```
@marcenacp marcenacp requested a review from a team as a code owner February 19, 2024 21:39
Copy link

github-actions bot commented Feb 19, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@marcenacp marcenacp changed the title 716 more Hugging Face datasets 716 more Hugging Face datasets can be read by mlcroissant. Feb 19, 2024
continue
return list(operations)
"""Overwrites nx.DiGraph to keep track of last operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or "pointer to the chain of last operations"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

):
"""Adds all operations for a node of type `Field`.
"""Adds all operations for a node of type `RecordSet`.

Operations are:

- `Join` if the field comes from several sources.
- `ReadFields` to specify how the fields are read.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Data operation here?

Also, change "if the field" with "if any of the fields related to the recordset"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the description because it doesn't bring anything.

Copy link
Contributor

@ccl-core ccl-core left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really useful, thanks!

NIT: I understand there is probably no space in this PR, but we could add a few tests for the new functions?

@marcenacp
Copy link
Contributor Author

NIT: I understand there is probably no space in this PR, but we could add a few tests for the new functions?

Thanks for the review! This is a refactoring of the existing code, so the current tests still hold.

@marcenacp marcenacp merged commit 5f52227 into main Feb 20, 2024
14 checks passed
@marcenacp marcenacp deleted the feature/last-operations branch February 20, 2024 12:44
@github-actions github-actions bot locked and limited conversation to collaborators Feb 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Technical debt] When constructing operations, we could iterate on RecordSets instead of iterating on Fields.
2 participants