adding fine tune example with s3 as the dataset store #2006

deepanker13 · 2024-02-19T13:10:37Z

What this PR does / why we need it:

Modifying s3.py to download dataset folder.
Adding separate files for examples using hugging face and s3 as dataset sources

Minor change

Using == for peft library as version mismatch within the docker images and local python environment can create issues.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #
#2005

Checklist:

Docs included if any changes are user facing

review-notebook-app · 2024-02-19T13:10:43Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coveralls · 2024-02-19T13:14:21Z

Pull Request Test Coverage Report for Build 8661146734

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
328 unchanged lines in 5 files lost coverage.
Overall coverage decreased (-0.2%) to 35.144%

Files with Coverage Reduction	New Missed Lines	%
cmd/training-operator.v1/main.go	49	0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go	63	63.25%
pkg/controller.v1/pytorch/pytorchjob_controller.go	68	68.47%
pkg/controller.v1/paddlepaddle/paddlepaddle_controller.go	69	64.93%
pkg/controller.v1/tensorflow/tfjob_controller.go	79	76.04%

Totals
Change from base Build 8585006097:	-0.2%
Covered Lines:	4347
Relevant Lines:	12369

💛 - Coveralls

deepanker13 · 2024-02-19T13:17:02Z

@johnugeorge @andreyvelich

deepanker13 · 2024-02-26T18:23:39Z

Hi @jinchihe @kuizhiqing , can you please review it?

tenzen-y · 2024-03-02T17:21:05Z

sdk/python/kubeflow/storage_initializer/requirements.txt

@@ -2,7 +2,7 @@ einops>=0.6.1
 transformers_stream_generator==0.0.4
 boto3==1.33.9
 transformers>=4.20.0
-peft>=0.3.0
+peft==0.3.0


Why do we need to constrain to v0.3.0?

so that there is no version conflict in the users virtual environment and the peft inside the container. If there is a mismatch it throws an error

tenzen-y · 2024-03-02T17:27:49Z

sdk/python/kubeflow/storage_initializer/s3.py

+        response = s3_client.list_objects_v2(
+            Bucket=self.config.bucket_name, Prefix=self.config.file_key
        )
-        print(f"File downloaded to: {VOLUME_PATH_DATASET}")
+        # Download the file
+        for obj in response.get("Contents", []):


IIUC, this list_objects_v2 API is a heavy load.
So, could you use objects API and filter function?: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/bucket/objects.html#filter

tenzen-y · 2024-03-02T17:28:20Z

sdk/python/kubeflow/training/api/training_client.py

@@ -172,8 +172,10 @@ def train(

        if isinstance(dataset_provider_parameters, S3DatasetParams):
            dp = "s3"
+            dataset_name = dataset_provider_parameters.file_key.replace("_" * 3, "/")


Why do we need this replacement?

tenzen-y · 2024-03-02T17:36:46Z

examples/sdk/train_api.ipynb

@@ -12,12 +12,13 @@
  },


Line #3. name="huggingface-test",

Should we keep using separate name so that we can concurrently deploy jobs with S3 example?

Reply via ReviewNB

created a separate file

tenzen-y · 2024-03-02T17:36:46Z

examples/sdk/train_api.ipynb

@@ -12,12 +12,13 @@
  },


Line #3. name="s3-test",

ditto

Reply via ReviewNB

tenzen-y · 2024-03-02T17:36:46Z

examples/sdk/train_api.ipynb

@@ -12,12 +12,13 @@
  },


Line #17. dataset_provider_parameters=S3DatasetParams(

What is this endpoint and credentials? Are you ok to expose these information?

Reply via ReviewNB

tenzen-y · 2024-03-02T17:36:46Z

examples/sdk/train_api.ipynb

@@ -12,12 +12,13 @@
  },


Could you preapre dedicated calls for s3 and huggingface ?

Reply via ReviewNB

rimolive · 2024-04-01T15:50:42Z

@deepanker13 Can you rebase this PR?

deepanker13 · 2024-04-04T18:17:18Z

@deepanker13 Can you rebase this PR?

sure, making other changes as well

rimolive · 2024-04-05T11:34:54Z

Can you fix the merge conflicts?

deepanker13 · 2024-04-05T14:32:13Z

Can you fix the merge conflicts?

@rimolive @tenzen-y please review it again

rimolive · 2024-04-05T15:12:50Z

You still need to sign-off the commits.

deepanker13 · 2024-04-08T04:26:53Z

You still need to sign-off the commits.

done

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

tenzen-y · 2024-04-10T14:16:01Z

sdk/python/kubeflow/storage_initializer/s3.py

            aws_access_key_id=self.config.access_key,
            aws_secret_access_key=self.config.secret_key,
-            endpoint_url=self.config.endpoint_url,


What is the reason that we should pass the endpoint_url into the s3_client.resource?

@tenzen-y With endpoint urls, it can work with any S3 protocol compliant implementations

That makes sense.
Thanks!

tenzen-y · 2024-04-10T14:19:46Z

examples/pytorch/language-modeling/train_api_s3_dataset.ipynb

@@ -0,0 +1,153 @@
+{


Line #23. "access_key": "qEMHyz8wNw",
The access_key and secret_key are still remaining.

Reply via ReviewNB

they are invalid keys

tenzen-y · 2024-04-11T07:25:01Z

examples/pytorch/language-modeling/train_api_s3_dataset.ipynb

@@ -0,0 +1,153 @@
+{


@deepanker13 Could you set keys instead of dummy keys like this:

# Need to set S3 ACCESS_KEY S3_ACCESS_KEY = ""

Reply via ReviewNB

@deepanker13 I meant that we should define vars here or dedicated block like this:

# Need to set S3 credentials s3_access_key = "" s3_secret_key = ""

And then, we should use those vars here:

# it is assumed for text related tasks, you have 'text' column in the dataset. # for more info on how dataset is loaded check load_and_preprocess_data function in sdk/python/kubeflow/trainer/hf_llm_training.py dataset_provider_parameters=S3DatasetParams( { "endpoint_url": "http://10.117.63.3", "bucket_name": "test", "file_key": "imdatta0___ultrachat_1k", "region_name": "us-east-1", + "access_key": s3_access_key, + "secret_key": s3_secret_key, } ),

oh ok, done

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

deepanker13 · 2024-04-12T18:47:12Z

Hi @tenzen-y, can we merge this

tenzen-y · 2024-04-12T19:42:30Z

Thx!

/lgtm

/approve

tenzen-y

/lgtm
/approve

google-oss-prow · 2024-04-12T19:53:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepanker13, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* s3 as dataset source code review changes Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * fixing python black test Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * removing conflicts in example file Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * retriggering CI Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * removing dummy keys Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * code review change for adding s3 keys block Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> --------- Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

google-oss-prow bot added the size/L label Feb 19, 2024

google-oss-prow bot requested review from jinchihe and kuizhiqing February 19, 2024 13:10

deepanker13 force-pushed the 2005 branch from e0d1f69 to b4f240a Compare February 19, 2024 13:19

tenzen-y reviewed Mar 2, 2024

View reviewed changes

deepanker13 force-pushed the 2005 branch from 7f4b8dd to 93db12d Compare April 8, 2024 04:26

deepanker13 force-pushed the 2005 branch from 076e47e to cc540fc Compare April 8, 2024 06:29

google-oss-prow bot added size/XXL and removed size/L labels Apr 8, 2024

deepanker13 added 2 commits April 8, 2024 12:09

s3 as dataset source code review changes

e9b736f

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

fixing python black test

612cbba

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

deepanker13 force-pushed the 2005 branch from cc540fc to 612cbba Compare April 8, 2024 06:41

google-oss-prow bot added size/L and removed size/XXL labels Apr 8, 2024

deepanker13 added 2 commits April 8, 2024 12:13

removing conflicts in example file

0e7c7bf

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

retriggering CI

5193af8

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

tenzen-y reviewed Apr 10, 2024

View reviewed changes

tenzen-y reviewed Apr 11, 2024

View reviewed changes

deepanker13 added 2 commits April 11, 2024 21:42

removing dummy keys

e4ec55c

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

code review change for adding s3 keys block

be05f8e

Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>

tenzen-y reviewed Apr 12, 2024

View reviewed changes

google-oss-prow bot assigned tenzen-y Apr 12, 2024

google-oss-prow bot added the lgtm label Apr 12, 2024

google-oss-prow bot added the approved label Apr 12, 2024

google-oss-prow bot merged commit 83ddd3b into kubeflow:master Apr 12, 2024
38 checks passed

johnugeorge mentioned this pull request Apr 12, 2024

Add working example for finetuning using s3 as the dataset location. #2005

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding fine tune example with s3 as the dataset store #2006

adding fine tune example with s3 as the dataset store #2006

deepanker13 commented Feb 19, 2024 •

edited

Loading

review-notebook-app bot commented Feb 19, 2024

coveralls commented Feb 19, 2024 •

edited

Loading

deepanker13 commented Feb 19, 2024

deepanker13 commented Feb 26, 2024 •

edited

Loading

tenzen-y Mar 2, 2024

deepanker13 Apr 5, 2024

tenzen-y Mar 2, 2024

deepanker13 Apr 5, 2024

tenzen-y Mar 2, 2024

deepanker13 Apr 8, 2024

tenzen-y Mar 2, 2024 •

edited

Loading

deepanker13 Apr 5, 2024

tenzen-y Mar 2, 2024 •

edited

Loading

tenzen-y Mar 2, 2024 •

edited

Loading

tenzen-y Mar 2, 2024 •

edited

Loading

rimolive commented Apr 1, 2024

deepanker13 commented Apr 4, 2024

rimolive commented Apr 5, 2024

deepanker13 commented Apr 5, 2024

rimolive commented Apr 5, 2024

deepanker13 commented Apr 8, 2024

tenzen-y Apr 10, 2024

johnugeorge Apr 11, 2024

tenzen-y Apr 11, 2024

tenzen-y Apr 10, 2024 •

edited

Loading

deepanker13 Apr 11, 2024

tenzen-y Apr 11, 2024 •

edited

Loading

deepanker13 Apr 11, 2024

tenzen-y Apr 11, 2024 •

edited

Loading

deepanker13 Apr 12, 2024

deepanker13 commented Apr 12, 2024

tenzen-y commented Apr 12, 2024

tenzen-y left a comment

google-oss-prow bot commented Apr 12, 2024

adding fine tune example with s3 as the dataset store #2006

adding fine tune example with s3 as the dataset store #2006

Conversation

deepanker13 commented Feb 19, 2024 • edited Loading

review-notebook-app bot commented Feb 19, 2024

coveralls commented Feb 19, 2024 • edited Loading

Pull Request Test Coverage Report for Build 8661146734

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

deepanker13 commented Feb 19, 2024

deepanker13 commented Feb 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

tenzen-y Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

tenzen-y Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

rimolive commented Apr 1, 2024

deepanker13 commented Apr 4, 2024

rimolive commented Apr 5, 2024

deepanker13 commented Apr 5, 2024

rimolive commented Apr 5, 2024

deepanker13 commented Apr 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepanker13 commented Apr 12, 2024

tenzen-y commented Apr 12, 2024

tenzen-y left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Apr 12, 2024

deepanker13 commented Feb 19, 2024 •

edited

Loading

coveralls commented Feb 19, 2024 •

edited

Loading

deepanker13 commented Feb 26, 2024 •

edited

Loading

tenzen-y Mar 2, 2024 •

edited

Loading

tenzen-y Mar 2, 2024 •

edited

Loading

tenzen-y Mar 2, 2024 •

edited

Loading

tenzen-y Mar 2, 2024 •

edited

Loading

tenzen-y Apr 10, 2024 •

edited

Loading

tenzen-y Apr 11, 2024 •

edited

Loading

tenzen-y Apr 11, 2024 •

edited

Loading