Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generate_to_hf Method to FastData Class #9

Merged
merged 7 commits into from
Nov 21, 2024

Conversation

AndreaFrancis
Copy link
Contributor

@AndreaFrancis AndreaFrancis commented Nov 9, 2024

Related Issue:
This PR addresses Issue #3.

Introduced a new method, generate_to_hf, in the FastData class. This method facilitates seamless integration with Hugging Face by generating and uploading datasets directly to the Hugging Face Hub with the following new parameters:

  • repo_id: The HuggingFace dataset name.
  • max_items_per_file: The maximum number of items to save in each file. Defaults to 100.
  • commit_every: The number of minutes between each commit. Defaults to 5.
  • private: Whether the repository is private. Defaults to False.
  • token: The token to use to commit to the repo. Defaults to the token saved on the machine.
  • delete_files_after: Whether to delete files after processing. Defaults to True.

Functionality:

  • Saves records in files with at most max_items_per_file records.
  • Automatically commits and pushes data to the Hugging Face Hub every commit_every minutes.
  • Generates a README.md (Dataset Card) with metadata including:
  • Dataset description.
  • Details on the generation process.
  • A fastdata and synthetic tags for easier discovery of datasets created using this library.
  • Optionally removes generated files after uploading.

Users will find datasets tagged with fastdata on Hugging Face Datasets, making it easier to explore AI-generated data.

Sample Output:
Check out a sample dataset created using this method: FastData Example.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @AndreaFrancis, thanks for working on this! I've left a few comments regarding the huggingface_hub implementation. Let me know if you have any questions :)

@ncoop57
Copy link
Contributor

ncoop57 commented Nov 19, 2024

sorry this sat for so long @AndreaFrancis ! Please @ me or request a review from me so I get notified else it risks getting lost in my inbox 😅 . This is awesome, thank you so much for putting this together!! I'm very excited to use this myself since we have a project that will be a lot easier with this added 🤓 .

As it stands now, I believe that the order of the inputs and what gets saved to the hub will not be preserved. I think it would be quite difficult to get that functionality so just documenting it is good since the original generate function does preserve the order, unless you know of an easy way of doing it while uploading it to the hub.

Could you add that note and also the suggestions from @Wauplin into your PR ensuring that its added to the ipynb instead of the py files? After that I think it is good to go 🤓

@AndreaFrancis
Copy link
Contributor Author

Thank you for your feedback, @Wauplin and @ncoop57! I’ve addressed all your suggestions regarding CommitScheduler and included the order topic in the code.

@ncoop57, please let me know if I missed anything. Here are the steps I followed:

  • Modified the nbs/00_core.ipynb file to add the generate_to_hf method and a simple test for it.
  • Added huggingface_hub as a dependency in settings.ini.
  • Created a simple example of the generate_to_hf method in examples/push_to_hf.py.
  • Ran nbdev_prepare successfully which generated code for fastdata/core.py.

Let me know if any further adjustments are needed :)

@ncoop57
Copy link
Contributor

ncoop57 commented Nov 21, 2024

Hey @AndreaFrancis looks good, however, I'm having issues running the code. It executes successfully, but the repo generated is empty. I double checked my HF token and it is properly set.

I tried fresh pip installing the repo and running the command python push_to_hf.py and it outputted this to the terminal:

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00,  9.26it/s]
A new repository has been create on ncoop57/my_dataset
[I am going to the beach this weekend ➡ *Voy a la playa este fin de semana*, I am going to the gym after work ➡ *Voy al gimnasio después del trabajo*, I am going to the park with my kids ➡ *Voy al parque con mis hijos*, I am going to the movies with my friends ➡ *Voy al cine con mis amigos*, I am going to the store to buy some groceries ➡ *Voy a la tienda a comprar algunos comestibles*, I am going to the library to read some books ➡ *Voy a la biblioteca a leer algunos libros*, I am going to the zoo to see the animals ➡ *Voy al zoológico a ver a los animales*, I am going to the museum to see the art ➡ *Voy al museo a ver el arte*, I am going to the restaurant to eat some food ➡ *Voy al restaurante a comer algo de comida*]

I similarly ran the code cells in the core.ipynb to no success. Here are the two repos it created:

  1. https://huggingface.co/datasets/ncoop57/my_dataset/tree/main
  2. https://huggingface.co/datasets/ncoop57/personas-translation-5ca1d7d5-b58a-463a-92f9-898a2aa2a015

I also checked the local file system and nothing was created. Any ideas for what I'm doing wrong?

AndreaFrancis and others added 2 commits November 21, 2024 07:55
Co-authored-by: Lucain <lucainp@gmail.com>
@AndreaFrancis
Copy link
Contributor Author

Thank you for your guidance! I had initially expected this behavior too, but I mistakenly tried fixing it with scheduler._push_to_hub() and scheduler.stop() 😅. Now I see that using scheduler.trigger().result() is a much better approach.
I ran the test and added validations for dataset creation as well as row validation after executing the generate_to_hf method. Additionally, I successfully created the following datasets:

This might be ready for another review (hopefully, the final one! 🤞)

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Looking good from a huggingface_hub side! 🤗

@ncoop57
Copy link
Contributor

ncoop57 commented Nov 21, 2024

Great, thanks so much @AndreaFrancis and @Wauplin !! Super excited to use this feature 🤓

@ncoop57 ncoop57 merged commit f6c40a7 into AnswerDotAI:main Nov 21, 2024
@davanstrien
Copy link

Amazing work, this is a super nice feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants