Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing Bloom Annotator #978

Open
wants to merge 20 commits into
base: dev
Choose a base branch
from
Open

Implementing Bloom Annotator #978

wants to merge 20 commits into from

Conversation

ian-cho
Copy link
Collaborator

@ian-cho ian-cho commented Jan 27, 2025

Why are these changes needed?

Add an additional column to indicate whether the document is present in the pre-existing Bloom filter model. This is specifically intended for the GneissWeb release.

@shahrokhDaijavad
Copy link
Member

@ian-cho I reviewed the README file and ran python local_python.py successfully on my laptop. Please add details to issue #981.
Also, as done by all transforms, please add a simple notebook for the Python run-time (and a link to it from the README file).

@ian-cho
Copy link
Collaborator Author

ian-cho commented Jan 28, 2025

@shahrokhDaijavad @touma-I added notebook for python runtime and also put a link to it from README file https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/universal/bloom/bloom_python.ipynb

@shahrokhDaijavad
Copy link
Member

@ian-cho I haven't had time to test the notebook, but I just realized that you are using bf.bloom model. Is that open source and available to download, so we don't need to include it in the repo?

@ian-cho
Copy link
Collaborator Author

ian-cho commented Jan 29, 2025

@shahrokhDaijavad thanks for pointing this out. Currently, I am not aware of any small, downloadable models okay for this demonstration. Thus, I trained a small model locally. In the future, we can replace this local model path with a Hugging Face model when it is ready later I guess.

@shahrokhDaijavad
Copy link
Member

@shahrokhDaijavad thanks for pointing this out. Currently, I am not aware of any small, downloadable models okay for this demonstration. Thus, I trained a small model locally. In the future, we can replace this local model path with a Hugging Face model when it is ready later I guess.

Thank you, @ian-cho. The size of the model 240KB is quite small, but for us to include it statically in an IBM-owned open repo, we need permission from IBM. I will bring this up with Bhatta, who may be able to get permission for us.

@shahrokhDaijavad
Copy link
Member

@BishwaBhatta As you may know, @ian-cho is using a small model that he has trained himself for this transform. This is fine for testing, but we cannot merge the code with that model included in the outer repo. So, if this small model will not be used at the end and an HF model will replace it, after successful testing, should we take it out before merging the code?

@ian-cho
Copy link
Collaborator Author

ian-cho commented Jan 31, 2025

@shahrokhDaijavad Hi, thanks for asking.

  • I tested the gneissweb.bloom (28GB) that Yohei trained, the current bloom transform code work with it. So, from my side, it is okay to remove that small model and move forward. But I am open to @BishwaBhatta's opinion.
  • @BishwaBhatta asked if the code can process fineweb parquet which has more than 30 columns. Among others, the document column is called text not contents, I think we have been through this naming issue for HAP, so shall we stick to the name contents instead of text as in test1.parquet? Otherwise, I would like to upload new test1.parquet.
    Thank you!

@shahrokhDaijavad
Copy link
Member

@shahrokhDaijavad Hi, thanks for asking.

* I tested the _gneissweb.bloom_ (28GB) that Yohei trained, the current bloom transform code work with it. So, from my side, it is okay to remove that small model and move forward. But I am open to @BishwaBhatta's opinion.

* @BishwaBhatta asked if the code can process fineweb parquet which has more than 30 columns. Among others, the document column is called _text_ not _contents_, I think we have been through this naming issue for HAP, so shall we stick to the name _contents_ instead of _text_ as in [test1.parquet](https://github.com/ian-cho/data-prep-kit/blob/dev/transforms/universal/bloom/test-data/input/test1.parquet)? Otherwise, I would like to upload new test1.parquet.
  Thank you!

Thank you, @ian-cho, for testing with the large model. For consistency with everything else in DPK, I think we should stick with the contents instead of text for the document column.

@ian-cho
Copy link
Collaborator Author

ian-cho commented Jan 31, 2025

@shahrokhDaijavad ok, then can I comment out this line ? The transform does not need text or contents. By doing this, we can stick to contents naming and also users do not have to change the name of fineweb parquet.

@shahrokhDaijavad
Copy link
Member

@shahrokhDaijavad ok, then can I comment out this line ? The transform does not need text or contents. By doing this, we can stick to contents naming and also users do not have to change the name of fineweb parquet.

@touma-I Do you see any problem with what @ian-cho is asking?

@shahrokhDaijavad
Copy link
Member

@ian-cho Sorry that I did not test the notebook until today. I did a make venv and in my venv, tried python local_python and it ran successfully. I had deleted the local output folder first (it should not have been put in the repo, as it gets created after the run, so please delete it). After the run, the output folder was created and had the 2 metadata.json and test1.parquet files were there. However, if I run the notebook, I get the error below:
image
What am I missing? Can you please look into this? Thanks.

@ian-cho
Copy link
Collaborator Author

ian-cho commented Feb 3, 2025

Hi, @shahrokhDaijavad, thank you for trying!

it should not have been put in the repo, as it gets created after the run, so please delete it

Sure, deleted.

What am I missing? Can you please look into this?

It was the path issue on my end. In your screenshot, please uncomment the 18th line and comment out the 19th line to see if works.
Screenshot 2025-02-03 at 13 00 46

ian-cho and others added 2 commits February 3, 2025 13:03
Signed-off-by: SHAHROKH DAIJAVAD <shahrokh@us.ibm.com>
@shahrokhDaijavad
Copy link
Member

Thanks, @ian-cho for pointing out the problem. I pushed the same change in 2 other files, and now the notebook works well. Thanks for deleting the output folder.
Any idea why the workflow test fails in CI/CD? If not, I will discuss it with @touma-I tomorrow.

@ian-cho
Copy link
Collaborator Author

ian-cho commented Feb 3, 2025

Hi, @shahrokhDaijavad, I have no ideas why it failed. Your and @touma-I 's help would be appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants