Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding fine tune example with s3 as the dataset store #2006

Merged
merged 6 commits into from
Apr 12, 2024

Conversation

deepanker13
Copy link
Contributor

@deepanker13 deepanker13 commented Feb 19, 2024

What this PR does / why we need it:

  1. Modifying s3.py to download dataset folder.
  2. Adding separate files for examples using hugging face and s3 as dataset sources

Minor change

  1. Using == for peft library as version mismatch within the docker images and local python environment can create issues.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #
#2005

Checklist:

  • Docs included if any changes are user facing

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coveralls
Copy link

coveralls commented Feb 19, 2024

Pull Request Test Coverage Report for Build 8661146734

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 328 unchanged lines in 5 files lost coverage.
  • Overall coverage decreased (-0.2%) to 35.144%

Files with Coverage Reduction New Missed Lines %
cmd/training-operator.v1/main.go 49 0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 63 63.25%
pkg/controller.v1/pytorch/pytorchjob_controller.go 68 68.47%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go 69 64.93%
pkg/controller.v1/tensorflow/tfjob_controller.go 79 76.04%
Totals Coverage Status
Change from base Build 8585006097: -0.2%
Covered Lines: 4347
Relevant Lines: 12369

💛 - Coveralls

@deepanker13
Copy link
Contributor Author

@johnugeorge @andreyvelich

@deepanker13
Copy link
Contributor Author

deepanker13 commented Feb 26, 2024

Hi @jinchihe @kuizhiqing , can you please review it?

@@ -2,7 +2,7 @@ einops>=0.6.1
transformers_stream_generator==0.0.4
boto3==1.33.9
transformers>=4.20.0
peft>=0.3.0
peft==0.3.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to constrain to v0.3.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that there is no version conflict in the users virtual environment and the peft inside the container. If there is a mismatch it throws an error

Comment on lines 51 to 55
response = s3_client.list_objects_v2(
Bucket=self.config.bucket_name, Prefix=self.config.file_key
)
print(f"File downloaded to: {VOLUME_PATH_DATASET}")
# Download the file
for obj in response.get("Contents", []):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this list_objects_v2 API is a heavy load.
So, could you use objects API and filter function?: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/bucket/objects.html#filter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -172,8 +172,10 @@ def train(

if isinstance(dataset_provider_parameters, S3DatasetParams):
dp = "s3"
dataset_name = dataset_provider_parameters.file_key.replace("_" * 3, "/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this replacement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it

@@ -12,12 +12,13 @@
},
Copy link
Member

@tenzen-y tenzen-y Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #3.        name="huggingface-test",

Should we keep using separate name so that we can concurrently deploy jobs with S3 example?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created a separate file

@@ -12,12 +12,13 @@
},
Copy link
Member

@tenzen-y tenzen-y Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #3.        name="s3-test",

ditto


Reply via ReviewNB

@@ -12,12 +12,13 @@
},
Copy link
Member

@tenzen-y tenzen-y Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #17.        dataset_provider_parameters=S3DatasetParams(

What is this endpoint and credentials? Are you ok to expose these information?


Reply via ReviewNB

@@ -12,12 +12,13 @@
},
Copy link
Member

@tenzen-y tenzen-y Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you preapre dedicated calls for s3 and huggingface ?


Reply via ReviewNB

@rimolive
Copy link
Member

rimolive commented Apr 1, 2024

@deepanker13 Can you rebase this PR?

@deepanker13
Copy link
Contributor Author

@deepanker13 Can you rebase this PR?

sure, making other changes as well

@rimolive
Copy link
Member

rimolive commented Apr 5, 2024

Can you fix the merge conflicts?

@deepanker13
Copy link
Contributor Author

Can you fix the merge conflicts?

@rimolive @tenzen-y please review it again

@rimolive
Copy link
Member

rimolive commented Apr 5, 2024

You still need to sign-off the commits.

@deepanker13
Copy link
Contributor Author

You still need to sign-off the commits.

done

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
aws_access_key_id=self.config.access_key,
aws_secret_access_key=self.config.secret_key,
endpoint_url=self.config.endpoint_url,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason that we should pass the endpoint_url into the s3_client.resource?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y With endpoint urls, it can work with any S3 protocol compliant implementations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.
Thanks!

@@ -0,0 +1,153 @@
{
Copy link
Member

@tenzen-y tenzen-y Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #23.                "access_key": "qEMHyz8wNw",

The access_key and secret_key are still remaining.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are invalid keys

@@ -0,0 +1,153 @@
{
Copy link
Member

@tenzen-y tenzen-y Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepanker13  Could you set keys instead of dummy keys like this:

# Need to set S3 ACCESS_KEY 
S3_ACCESS_KEY = ""


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@tenzen-y tenzen-y Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepanker13 I meant that we should define vars here or dedicated block like this:

# Need to set S3 credentials
s3_access_key = ""
s3_secret_key = ""

And then, we should use those vars here:

  # it is assumed for text related tasks, you have 'text' column in the dataset.

  # for more info on how dataset is loaded check load_and_preprocess_data function in sdk/python/kubeflow/trainer/hf_llm_training.py

  dataset_provider_parameters=S3DatasetParams(

    {

            "endpoint_url": "http://10.117.63.3",

      "bucket_name": "test",

      "file_key": "imdatta0___ultrachat_1k",

      "region_name": "us-east-1",

+     "access_key": s3_access_key,

+     "secret_key": s3_secret_key,

    }

  ),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok, done

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
@deepanker13
Copy link
Contributor Author

Hi @tenzen-y, can we merge this

Copy link
Member

Thx!

/lgtm

/approve

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepanker13, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 83ddd3b into kubeflow:master Apr 12, 2024
38 checks passed
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* s3 as dataset source code review changes

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

* fixing python black test

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

* removing conflicts in example file

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

* retriggering CI

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

* removing dummy keys

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

* code review change for adding s3 keys block

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

---------

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
* s3 as dataset source code review changes

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

* fixing python black test

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

* removing conflicts in example file

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

* retriggering CI

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

* removing dummy keys

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

* code review change for adding s3 keys block

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

---------

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants