-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding fine tune example with s3 as the dataset store #2006
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Pull Request Test Coverage Report for Build 8661146734Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Hi @jinchihe @kuizhiqing , can you please review it? |
@@ -2,7 +2,7 @@ einops>=0.6.1 | |||
transformers_stream_generator==0.0.4 | |||
boto3==1.33.9 | |||
transformers>=4.20.0 | |||
peft>=0.3.0 | |||
peft==0.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to constrain to v0.3.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that there is no version conflict in the users virtual environment and the peft inside the container. If there is a mismatch it throws an error
response = s3_client.list_objects_v2( | ||
Bucket=self.config.bucket_name, Prefix=self.config.file_key | ||
) | ||
print(f"File downloaded to: {VOLUME_PATH_DATASET}") | ||
# Download the file | ||
for obj in response.get("Contents", []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this list_objects_v2
API is a heavy load.
So, could you use objects
API and filter
function?: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/bucket/objects.html#filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -172,8 +172,10 @@ def train( | |||
|
|||
if isinstance(dataset_provider_parameters, S3DatasetParams): | |||
dp = "s3" | |||
dataset_name = dataset_provider_parameters.file_key.replace("_" * 3, "/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this replacement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed it
examples/sdk/train_api.ipynb
Outdated
@@ -12,12 +12,13 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #3. name="huggingface-test",
Should we keep using separate name so that we can concurrently deploy jobs with S3 example?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created a separate file
examples/sdk/train_api.ipynb
Outdated
@@ -12,12 +12,13 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples/sdk/train_api.ipynb
Outdated
@@ -12,12 +12,13 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #17. dataset_provider_parameters=S3DatasetParams(
What is this endpoint and credentials? Are you ok to expose these information?
Reply via ReviewNB
examples/sdk/train_api.ipynb
Outdated
@@ -12,12 +12,13 @@ | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepanker13 Can you rebase this PR? |
sure, making other changes as well |
Can you fix the merge conflicts? |
You still need to sign-off the commits. |
done |
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
aws_access_key_id=self.config.access_key, | ||
aws_secret_access_key=self.config.secret_key, | ||
endpoint_url=self.config.endpoint_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason that we should pass the endpoint_url
into the s3_client.resource
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y With endpoint urls, it can work with any S3 protocol compliant implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
Thanks!
@@ -0,0 +1,153 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #23. "access_key": "qEMHyz8wNw",
The access_key
and secret_key
are still remaining.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are invalid keys
@@ -0,0 +1,153 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepanker13 Could you set keys instead of dummy keys like this:
# Need to set S3 ACCESS_KEY S3_ACCESS_KEY = ""
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepanker13 I meant that we should define vars here or dedicated block like this:
# Need to set S3 credentials s3_access_key = "" s3_secret_key = ""
And then, we should use those vars here:
# it is assumed for text related tasks, you have 'text' column in the dataset. # for more info on how dataset is loaded check load_and_preprocess_data function in sdk/python/kubeflow/trainer/hf_llm_training.py dataset_provider_parameters=S3DatasetParams( { "endpoint_url": "http://10.117.63.3", "bucket_name": "test", "file_key": "imdatta0___ultrachat_1k", "region_name": "us-east-1", + "access_key": s3_access_key, + "secret_key": s3_secret_key, } ),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ok, done
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
Hi @tenzen-y, can we merge this |
Thx! /lgtm /approve
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deepanker13, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* s3 as dataset source code review changes Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * fixing python black test Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * removing conflicts in example file Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * retriggering CI Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * removing dummy keys Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * code review change for adding s3 keys block Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> --------- Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
* s3 as dataset source code review changes Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * fixing python black test Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * removing conflicts in example file Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * retriggering CI Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * removing dummy keys Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> * code review change for adding s3 keys block Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com> --------- Signed-off-by: deepanker13 <deepanker.gupta@nutanix.com>
What this PR does / why we need it:
Minor change
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
#2005
Checklist: