Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lamma2 training on dataset downloaded from Huggingface. #3529

Closed
sudhir2016 opened this issue Aug 12, 2023 · 13 comments
Closed

Lamma2 training on dataset downloaded from Huggingface. #3529

sudhir2016 opened this issue Aug 12, 2023 · 13 comments
Assignees

Comments

@sudhir2016
Copy link

I am trying to use Ludwig to finetune Llama2 7b in Colab with a Alpaca type dataset downloaded from Huggingface. Using the example notebook provided. Tried various approaches without success. Getting error LLM with quantization requires the 'local' backend to be set in the config. Please advise.

@arnavgarg1
Copy link
Contributor

Hi @sudhir2016!

Can you add this to the Ludwig config in the Colab notebook?

backend:
    type: local

@sudhir2016
Copy link
Author

Than you @arnavgarg1. I did what you asked. Now I am getting this error. RuntimeError: Caught exception during model preprocessing: template invalid for zero-shot prompt: Prompt template must contain either the 'sample' field or one of the columns from the dataset. This is how the dataset I loaded from Haggingface using load_dataset looks like.

DatasetDict({
train: Dataset({
features: ['instruction', 'input', 'output', 'prompt'],
num_rows: 18612
})
})

@arnavgarg1
Copy link
Contributor

Thanks @sudhir2016.

Are you able to share your colab notebook with me? It might help me work through the issues so I can let you know what needed to be changed to get it to work.

Based on your response and the error, I think there's 2 different things that need to be addressed:

  1. Ludwig isn't compatible directly with the Torch DataLoader class. When passing in a dataset, Ludwig expects either a Pandas Dataframe, Dask DataFrame, Dataset located in remote storage (such as S3), or the local path of the file.

  2. Are you able to share the way you're configuring your prompt? Typically prompt templates contain:

  • Reserved Keywords, {task}, {sample} and {context}. If no optional keywords are specified, at the very least, the template must contain {sample}, which beneath the surface takes your row of data and inserts it into the prompt.
  • Optional Keywords: {col1}, {col2} etc, where col1 and col2 are names of features in your dataset. So in the instruction fine tuning example, we used {input} to refer to the column called input in the alpaca dataset and {instruction} to refer to the column called instruction in the alpaca dataset.

Hope this helps unblock you, but also more than happy to look over your colab notebook.

@tgaddair
Copy link
Collaborator

tgaddair commented Aug 12, 2023

Sounds like there may be more than one issue here, but the need to manually specify the local backend is not desirable, so reverting the PR that caused this (#3531) and yanking v0.8.1. We'll follow-up with 0.8.2 soon, but in the meantime, I suggest downgrading to v0.8.

@sudhir2016
Copy link
Author

@sudhir2016
Copy link
Author

Thank you @tgaddair. Looking forward to further guidance.

@tgaddair
Copy link
Collaborator

Hey @sudhir2016, thanks for sharing your notebook! Can you try converting the HF dataset into a Pandas DataFrame like this:

import pandas as pd

data = load_dataset('iamtarun/python_code_instructions_18k_alpaca')

# Convert training set into Pandas
df = pd.DataFrame(data["train"])

model = LudwigModel(config=config, logging_level=logging.INFO)
results = model.train(dataset=df)

@tgaddair
Copy link
Collaborator

@arnavgarg1 we should add an issue to allow users to provide HuggingFace DatasetDict objects as inputs, it should be easy to convert them in Pandas automatically.

@sudhir2016
Copy link
Author

Thank you @tgaddair. I was able to train. I then tried to upload to hf_hub. But I am getting this error. Exception: Model artifacts not found at /content/results/model/model_weights. It is possible that model at '/content/results' hasn't been trained yet, or something wentwrong during training where the model's weights were not saved.

The model weights directory contains three files

-/content/results/api_experiment_run/model/model_weights/README.md

  • /content/results/api_experiment_run/model/model_weights/adapter_config.json
  • /content/results/api_experiment_run/model/model_weights/adapter_model.bin

Please suggest.

@sudhir2016
Copy link
Author

I thought that there was some issue with my training dataset so I trained on "ludwig://alpaca" dataset and I got exactly the same error when I tried to upload tp hf_hub. Exception: Model artifacts not found at /content/results/model/model_weights. It is possible that model at '/content/results' hasn't been trained yet, or something wentwrong during training where the model's weights were not saved.

@arnavgarg1
Copy link
Contributor

Hi @sudhir2016, sorry you're running into issues with uploading your model weights out of colab and into HuggingFace Hub. Give me a few hours to test this flow myself and come back with a solution.

Are you able to share the exact command you're running to upload these model weights to HuggingFace Hub?

@arnavgarg1 arnavgarg1 self-assigned this Aug 14, 2023
@arnavgarg1
Copy link
Contributor

arnavgarg1 commented Aug 14, 2023

Hi @sudhir2016, I was able to take a look at this and it seems like things are working fine.

Here are some instructions on how to upload the model weights successfully (cc: @tgaddair let's update your colab notebook to also include these steps):

  1. Once your model finishes training, you should see something like this:
Screenshot 2023-08-14 at 4 01 47 PM
  1. Specifically, the thing we want to make note of is where the model is saved, which in this case, is /content/results/api_experiment_run (see the line that says INFO:ludwig.api:Saved to: /content/results/api_experiment_run in the screenshot)
Screenshot 2023-08-14 at 4 03 46 PM
  1. You can run the following command: !ludwig upload hf_hub --repo_id arnavgrg/alpaca_qlora_test --model_path /content/results/api_experiment_run in another cell. You should swap repo_id with a name that is custom to you. This typically is in the format of <your huggingface username>/<model_name>. When you run this command, it will ask you to provide a huggingface token, which you can create and grab from (here)[https://huggingface.co/settings/tokens]. Finally, it will tell you exactly where the model was uploaded. From my run (trained only for 5 steps), you can see the uploaded artifacts here: arnavgrg/alpaca_qlora_test

Let me know if this works.

@sudhir2016
Copy link
Author

Thank you @arnavgarg1 for your detailed feedback. I am traveling so will try and revert after a few days

@mhabedank mhabedank closed this as not planned Won't fix, can't repro, duplicate, stale Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants